Hallucination Detection

Detect hallucinations, unsupported claims, and contradictions in LLM outputs. 5 context-grounded metrics with optional NLI and LLM augmentation.

📝
TL;DR
  • 5 metrics: faithfulness, claim_support, factual_consistency, contradiction_detection, hallucination_score
  • Word-overlap heuristic by default. Install ai-evaluation[nli] for DeBERTa-based NLI scoring.
  • Pass augment=True + a model to refine results with an LLM

These metrics check whether LLM outputs stay faithful to the provided context. All return a continuous score between 0.0 and 1.0.

from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output="The Eiffel Tower is 330 metres tall and located in Berlin.",
    context="The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It is 330 metres tall.",
)

print(result.score)   # 0.5 (partial hallucination — Berlin is wrong)
print(result.passed)  # False

Metrics

MetricWhat it checksSupports augment?
faithfulnessEvery claim in the output is supported by contextYes
claim_supportIndividual claims are supported by contextYes
factual_consistencyStated facts align with reference materialYes
contradiction_detectionOutput directly contradicts the contextNo
hallucination_scoreOverall hallucination combining multiple signalsYes

faithfulness

Whether the output is faithful to the provided context. Every claim in the output is checked against the context.

result = evaluate(
    "faithfulness",
    output="Python was created by Guido van Rossum and first released in 1991.",
    context="Python is a programming language created by Guido van Rossum, first released on February 20, 1991.",
)
# score → 1.0 (fully faithful)

claim_support

Whether individual claims are supported by context. Useful when you need claim-level granularity.

result = evaluate(
    "claim_support",
    output="Mars is the fourth planet from the Sun and has three moons.",
    context="Mars is the fourth planet from the Sun. It has two natural moons, Phobos and Deimos.",
)
# score → 0.5 (one claim supported, one not)

factual_consistency

Whether stated facts align with the reference material. Focuses on consistency, not completeness.

result = evaluate(
    "factual_consistency",
    output="The company reported revenue of $5.2 billion in Q3 2024.",
    context="In Q3 2024, the company posted revenue of $5.2 billion, up 12% year-over-year.",
)
# score → 1.0 (factually consistent)

contradiction_detection

Whether the output directly contradicts the context.

result = evaluate(
    "contradiction_detection",
    output="The patient's blood pressure decreased after the medication.",
    context="After administering the medication, the patient's blood pressure increased from 120/80 to 145/95.",
)
# score → 0.0 (contradiction detected)

hallucination_score

Overall hallucination score combining multiple detection signals.

result = evaluate(
    "hallucination_score",
    output="The Great Wall of China is visible from space and was built in a single dynasty.",
    context="The Great Wall of China was built over many centuries by multiple dynasties. It is not visible from space with the naked eye.",
)
# score → 0.0 (highly hallucinated)

LLM Augmentation

By default, metrics use a word-overlap heuristic. For higher accuracy, pass augment=True with a model parameter. This runs the heuristic first, then refines the result with an LLM.

from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output="Tesla was founded by Elon Musk in 2003.",
    context="Tesla, Inc. was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning. Elon Musk joined as chairman in 2004.",
    augment=True,
    model="gemini/gemini-2.5-flash",
)

print(result.score)             # 0.25 (catches the misattribution)
print(result.metadata["engine"])  # "local+llm"

The heuristic alone might miss subtle misattributions. Augmentation catches that “founded by Elon Musk” is not supported by the context.

Supported on: faithfulness, claim_support, factual_consistency, hallucination_score.

NLI-Based Detection

For the most accurate detection without LLM calls, install the NLI dependency:

pip install ai-evaluation[nli]

This enables DeBERTa-based natural language inference. Once installed, it’s used automatically — no code changes.

Was this page helpful?

Questions & Discussion