Hallucination Detection: LLM Output Accuracy Metrics

Detect hallucinations, unsupported claims, and contradictions in LLM outputs. 5 context-grounded metrics with optional NLI and LLM augmentation.

📝

TL;DR

5 metrics: faithfulness, claim_support, factual_consistency, contradiction_detection, hallucination_score
Word-overlap heuristic by default. Install ai-evaluation[nli] for DeBERTa-based NLI scoring.
Pass augment=True + a model to refine results with an LLM

These metrics check whether LLM outputs stay faithful to the provided context. All return a continuous score between 0.0 and 1.0.

from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output="The Eiffel Tower is 330 metres tall and located in Berlin.",
    context="The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It is 330 metres tall.",
)

print(result.score)   # 0.5 (partial hallucination — Berlin is wrong)
print(result.passed)  # False

Metrics

Metric	What it checks	Supports `augment`?
`faithfulness`	Every claim in the output is supported by context	Yes
`claim_support`	Individual claims are supported by context	Yes
`factual_consistency`	Stated facts align with reference material	Yes
`contradiction_detection`	Output directly contradicts the context	No
`hallucination_score`	Overall hallucination combining multiple signals	Yes

faithfulness

Whether the output is faithful to the provided context. Every claim in the output is checked against the context.

result = evaluate(
    "faithfulness",
    output="Python was created by Guido van Rossum and first released in 1991.",
    context="Python is a programming language created by Guido van Rossum, first released on February 20, 1991.",
)
# score → 1.0 (fully faithful)

claim_support

Whether individual claims are supported by context. Useful when you need claim-level granularity.

result = evaluate(
    "claim_support",
    output="Mars is the fourth planet from the Sun and has three moons.",
    context="Mars is the fourth planet from the Sun. It has two natural moons, Phobos and Deimos.",
)
# score → 0.5 (one claim supported, one not)

factual_consistency

Whether stated facts align with the reference material. Focuses on consistency, not completeness.

result = evaluate(
    "factual_consistency",
    output="The company reported revenue of $5.2 billion in Q3 2024.",
    context="In Q3 2024, the company posted revenue of $5.2 billion, up 12% year-over-year.",
)
# score → 1.0 (factually consistent)

contradiction_detection

Whether the output directly contradicts the context.

result = evaluate(
    "contradiction_detection",
    output="The patient's blood pressure decreased after the medication.",
    context="After administering the medication, the patient's blood pressure increased from 120/80 to 145/95.",
)
# score → 0.0 (contradiction detected)

hallucination_score

Overall hallucination score combining multiple detection signals.

result = evaluate(
    "hallucination_score",
    output="The Great Wall of China is visible from space and was built in a single dynasty.",
    context="The Great Wall of China was built over many centuries by multiple dynasties. It is not visible from space with the naked eye.",
)
# score → 0.0 (highly hallucinated)

LLM Augmentation

By default, metrics use a word-overlap heuristic. For higher accuracy, pass augment=True with a model parameter. This runs the heuristic first, then refines the result with an LLM.

from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output="Tesla was founded by Elon Musk in 2003.",
    context="Tesla, Inc. was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning. Elon Musk joined as chairman in 2004.",
    augment=True,
    model="gemini/gemini-2.5-flash",
)

print(result.score)             # 0.25 (catches the misattribution)
print(result.metadata["engine"])  # "local+llm"

The heuristic alone might miss subtle misattributions. Augmentation catches that “founded by Elon Musk” is not supported by the context.

Supported on: faithfulness, claim_support, factual_consistency, hallucination_score.

NLI-Based Detection

For the most accurate detection without LLM calls, install the NLI dependency:

pip install ai-evaluation[nli]

This enables DeBERTa-based natural language inference. Once installed, it’s used automatically — no code changes.

RAG Metrics

Context recall, precision, groundedness, and more.

Guardrails

Real-time security scanners for prompt injection, PII, and more.

Was this page helpful?

Questions & Discussion