Evaluation Using Interface

Input:

  • Required Inputs:
    • input: The source material or reference text.
    • output: The generated text to evaluate for hallucinations.

Output:

  • Result: Returns ‘Passed’ if no hallucination is detected, ‘Failed’ if hallucination is detected.
  • Reason: A detailed explanation of why the output was classified as containing or not containing hallucinations.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • input: string - The source material or reference text.
    • output: string - The generated text to evaluate for hallucinations.

Output:

  • Result: Returns a list containing ‘Passed’ if no hallucination is detected, or ‘Failed’ if hallucination is detected.
  • Reason: Provides a detailed explanation of the evaluation.
result = evaluator.evaluate(
    eval_templates="detect_hallucination", 
    inputs={
        "input": "Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.",
        "output": "Honey doesn't spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."
    },
    model_name="turing_flash"
)

print(result.eval_results[0].metrics[0].value)
print(result.eval_results[0].reason)

Example Output:

['Passed']
The evaluation is 'Passed' because the output accurately reflects information from the input without adding fabricated content.

- The output statement that "honey doesn't spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes" directly corresponds to information in the input about honey having "low moisture content and high acidity, creating an environment that resists bacteria and microorganisms."
- All key facts presented (low moisture, high acidity, preventing microbial growth) are explicitly supported by the input.
- The output is a condensed version of the input information without introducing new claims.

What to do If you get Undesired Results

If the content is evaluated as containing hallucinations (Failed) and you want to improve it:

  • Ensure all claims in your output are explicitly supported by the source material
  • Avoid extrapolating or generalizing beyond what is stated in the input
  • Remove any specific details that aren’t mentioned in the source text
  • Use qualifying language (like “may,” “could,” or “suggests”) when necessary
  • Stick to paraphrasing rather than adding new information
  • Double-check numerical values, dates, and proper nouns against the source
  • Consider directly quoting from the source for critical information

Comparing Detect Hallucination with Similar Evals

  • Factual Accuracy: While Detect Hallucination checks for fabricated information not in the source, Factual Accuracy evaluates the overall factual correctness of content against broader knowledge.
  • Groundedness: Detect Hallucination focuses on absence of fabricated content, while Groundedness measures how well the output is supported by the source material.
  • Context Adherence: Detect Hallucination identifies made-up information, while Context Adherence evaluates how well the output adheres to the given context.