Hallucination Detection with Faithfulness & Groundedness

Catch LLM hallucinations in RAG outputs using faithfulness (local NLI) and groundedness metrics. Combine both in a single evaluate() call.

📝

TL;DR

Catch LLM hallucinations using two complementary metrics: faithfulness (local NLI, catches contradictions) and groundedness (Turing model, catches unsourced claims) — then combine both in a single evaluate() call.

Time	Difficulty	Package
10 min	Beginner	`ai-evaluation`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Install

pip install 'ai-evaluation[nli]'

The [nli] extra installs the local NLI model used by faithfulness and contradiction_detection. Without it, these metrics fall back to a less accurate word-overlap heuristic.

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

Metrics for hallucination detection

Three built-in metrics cover hallucination detection. Local NLI metrics run on your machine with no API key; Turing metrics use FutureAGI’s purpose-built evaluation models.

Metric	Engine	Required inputs	Output	What it catches
`faithfulness`	Local NLI	`output, context`	score 0–1	Contradictions between output and context
`groundedness`	Turing or local	`output, input, context`	Pass/Fail	Output claims not traceable to context
`context_adherence`	Turing	`output, context`	score 0–1	How strictly output stays within context boundaries

Tutorial

Score faithfulness on a single response

This step checks whether the LLM response is consistent with the retrieved context — no contradictions allowed.

from fi.evals import evaluate

context = (
    "The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
    "It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
    "kilometers from Earth. JWST observes primarily in the infrared spectrum."
)

question = "When was the James Webb Space Telescope launched and where does it orbit?"

# A response that faithfully reflects the context
response = (
    "The James Webb Space Telescope was launched on December 25, 2021. "
    "It orbits the Sun at the L2 Lagrange point, about 1.5 million kilometers from Earth."
)

result = evaluate(
    "faithfulness",
    output=response,
    context=context,
    input=question,
)

print(f"Faithfulness score : {result.score:.2f}")
print(f"Passed             : {result.passed}")
print(f"Reason             : {result.reason}")

Expected output:

Faithfulness score : 1.00
Passed             : True
Reason             : 2/2 claims supported.

Now test a hallucinated response:

from fi.evals import evaluate

context = (
    "The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
    "It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
    "kilometers from Earth. JWST observes primarily in the infrared spectrum."
)

question = "When was the James Webb Space Telescope launched and where does it orbit?"

hallucinated_response = (
    "The James Webb Space Telescope was launched on March 10, 2022. "
    "It orbits Earth at an altitude of 600 kilometers."
)

result = evaluate(
    "faithfulness",
    output=hallucinated_response,
    context=context,
    input=question,
)

print(f"Faithfulness score : {result.score:.2f}")
print(f"Passed             : {result.passed}")
print(f"Reason             : {result.reason}")

Expected output:

Faithfulness score : 0.00
Passed             : False
Reason             : 0/2 claims supported.

Check groundedness with the Turing engine

groundedness checks whether every claim in the output is traceable to the provided context. Unlike faithfulness (which flags direct contradictions), groundedness also catches plausible-sounding additions the model makes that have no basis in the context.

Test a response that adds unsourced facts:

from fi.evals import evaluate

context = (
    "The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
    "It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
    "kilometers from Earth. JWST observes primarily in the infrared spectrum."
)

question = "When was the James Webb Space Telescope launched and where does it orbit?"

# A response that adds facts not present in the context
ungrounded_response = (
    "The James Webb Space Telescope was launched on December 25, 2021. "
    "It orbits the Sun at L2, 1.5 million kilometers from Earth. "
    "It is serviced every year by astronauts in low Earth orbit."
)

result = evaluate(
    "groundedness",
    output=ungrounded_response,
    context=context,
    input=question,
    model="turing_small",
)

print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")

Sample output (will vary):

Passed : False
Reason : The response includes a claim that is not supported by the provided context.

Note

groundedness is model-based, so exact wording and pass/fail can vary. Use the result reason to identify unsupported claims and tune your prompt/retrieval pipeline.

Now test a clean response that stays within the context:

clean_response = (
    "The James Webb Space Telescope was launched on December 25, 2021 "
    "and orbits the Sun at L2, about 1.5 million kilometers from Earth."
)

result = evaluate(
    "groundedness",
    output=clean_response,
    context=context,
    input=question,
    model="turing_small",
)

print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")

Sample output (will vary):

Passed : True
Reason : All claims are traceable to the provided context.

Combine both metrics in one evaluate() call

Pass a list of metric names to run faithfulness and groundedness together on a single output. Returns a BatchResult you can iterate or index by name.

from fi.evals import evaluate

context = (
    "The Great Barrier Reef is the world's largest coral reef system, located in the "
    "Coral Sea off the coast of Queensland, Australia. It is composed of over 2,900 "
    "individual reefs and 900 islands stretching over 2,300 kilometers."
)

question = "Where is the Great Barrier Reef and how large is it?"

response = (
    "The Great Barrier Reef is located in the Coral Sea off Queensland, Australia. "
    "It spans over 2,300 kilometers and consists of more than 2,900 individual reefs "
    "and 900 islands."
)

results = evaluate(
    ["faithfulness", "groundedness"],
    output=response,
    context=context,
    input=question,
)

# Iterate over both results
for result in results:
    status = "PASS" if result.passed else "FAIL"
    print(f"{result.eval_name:<15} score={result.score:.2f}  {status}")
    print(f"  Reason: {result.reason}")
    print()

# Or look up by name directly
faith_result = results.get("faithfulness")
ground_result = results.get("groundedness")

print(f"Both metrics passed: {faith_result.passed and ground_result.passed}")

Expected output:

faithfulness    score=1.00  PASS
  Reason: 4/4 claims supported.

groundedness    score=1.00  PASS
  Reason: All claims traceable to context.

Both metrics passed: True

Tip

faithfulness runs entirely locally via NLI — no API key required. groundedness can also run locally (omit model=) or via Turing models. Use turing_flash for lowest latency, turing_small for a balanced default, or turing_large for highest accuracy.

What you built

You can now detect LLM hallucinations using faithfulness (contradiction detection) and groundedness (unsourced claim detection), individually or combined in a single evaluate call.

Scored a RAG response for faithfulness (local NLI) to detect contradictions — no API key needed
Used groundedness (Turing, Pass/Fail) to catch unsourced claims the LLM adds beyond the context
Combined multiple metrics in a single evaluate([...]) call returning a BatchResult

Questions & Discussion

Hallucination Detection with Faithfulness & Groundedness

Install

Metrics for hallucination detection

Tutorial

Score faithfulness on a single response

Check groundedness with the Turing engine

Combine both metrics in one evaluate() call

What you built

Next steps

Running Your First Eval

Custom Eval Metrics

Eval in CI/CD

All Built-in Metrics