Hallucination Detection
Detect hallucinations, unsupported claims, and contradictions in LLM outputs. 5 context-grounded metrics with optional NLI and LLM augmentation.
- 5 metrics: faithfulness, claim_support, factual_consistency, contradiction_detection, hallucination_score
- Word-overlap heuristic by default. Install
ai-evaluation[nli]for DeBERTa-based NLI scoring. - Pass
augment=True+ a model to refine results with an LLM
These metrics check whether LLM outputs stay faithful to the provided context. All return a continuous score between 0.0 and 1.0.
from fi.evals import evaluate
result = evaluate(
"faithfulness",
output="The Eiffel Tower is 330 metres tall and located in Berlin.",
context="The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It is 330 metres tall.",
)
print(result.score) # 0.5 (partial hallucination — Berlin is wrong)
print(result.passed) # False
Metrics
| Metric | What it checks | Supports augment? |
|---|---|---|
faithfulness | Every claim in the output is supported by context | Yes |
claim_support | Individual claims are supported by context | Yes |
factual_consistency | Stated facts align with reference material | Yes |
contradiction_detection | Output directly contradicts the context | No |
hallucination_score | Overall hallucination combining multiple signals | Yes |
faithfulness
Whether the output is faithful to the provided context. Every claim in the output is checked against the context.
result = evaluate(
"faithfulness",
output="Python was created by Guido van Rossum and first released in 1991.",
context="Python is a programming language created by Guido van Rossum, first released on February 20, 1991.",
)
# score → 1.0 (fully faithful)
claim_support
Whether individual claims are supported by context. Useful when you need claim-level granularity.
result = evaluate(
"claim_support",
output="Mars is the fourth planet from the Sun and has three moons.",
context="Mars is the fourth planet from the Sun. It has two natural moons, Phobos and Deimos.",
)
# score → 0.5 (one claim supported, one not)
factual_consistency
Whether stated facts align with the reference material. Focuses on consistency, not completeness.
result = evaluate(
"factual_consistency",
output="The company reported revenue of $5.2 billion in Q3 2024.",
context="In Q3 2024, the company posted revenue of $5.2 billion, up 12% year-over-year.",
)
# score → 1.0 (factually consistent)
contradiction_detection
Whether the output directly contradicts the context.
result = evaluate(
"contradiction_detection",
output="The patient's blood pressure decreased after the medication.",
context="After administering the medication, the patient's blood pressure increased from 120/80 to 145/95.",
)
# score → 0.0 (contradiction detected)
hallucination_score
Overall hallucination score combining multiple detection signals.
result = evaluate(
"hallucination_score",
output="The Great Wall of China is visible from space and was built in a single dynasty.",
context="The Great Wall of China was built over many centuries by multiple dynasties. It is not visible from space with the naked eye.",
)
# score → 0.0 (highly hallucinated)
LLM Augmentation
By default, metrics use a word-overlap heuristic. For higher accuracy, pass augment=True with a model parameter. This runs the heuristic first, then refines the result with an LLM.
from fi.evals import evaluate
result = evaluate(
"faithfulness",
output="Tesla was founded by Elon Musk in 2003.",
context="Tesla, Inc. was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning. Elon Musk joined as chairman in 2004.",
augment=True,
model="gemini/gemini-2.5-flash",
)
print(result.score) # 0.25 (catches the misattribution)
print(result.metadata["engine"]) # "local+llm"
The heuristic alone might miss subtle misattributions. Augmentation catches that “founded by Elon Musk” is not supported by the context.
Supported on: faithfulness, claim_support, factual_consistency, hallucination_score.
NLI-Based Detection
For the most accurate detection without LLM calls, install the NLI dependency:
pip install ai-evaluation[nli]
This enables DeBERTa-based natural language inference. Once installed, it’s used automatically — no code changes.