Evaluation Using Interface

Input:

  • Required Inputs:
    • input: URL or file path to the image being captioned.
    • output: The caption text to evaluate.

Output:

  • Result: Returns ‘Passed’ if the caption accurately represents what’s in the image without hallucination, ‘Failed’ if the caption contains hallucinated elements.
  • Reason: A detailed explanation of why the caption was classified as containing or not containing hallucinations.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • input: string - URL or file path to the image being captioned.
    • output: string - The caption text to evaluate.

Output:

  • Result: Returns a list containing ‘Passed’ if the caption accurately represents what’s in the image without hallucination, or ‘Failed’ if the caption contains hallucinated elements.
  • Reason: Provides a detailed explanation of the evaluation.
result = evaluator.evaluate(
    eval_templates="caption_hallucination", 
    inputs={
        "input": "https://www.esparklearning.com/app/uploads/2024/04/Albert-Einstein-generated-by-AI-1024x683.webp",
        "output": "old man"
    },
    model_name="turing_flash"
)

print(result.eval_results[0].metrics[0].value)
print(result.eval_results[0].reason)

Example Output:

['Passed']
The evaluation is 'Passed' because the caption "old man" is accurately descriptive of what is clearly visible in the image without adding any hallucinated details.

- The image does indeed show an elderly male figure with characteristic features of advanced age (white/gray hair, wrinkles, aged appearance).
- The caption is minimalist but factually correct, avoiding any specific claims about identity, activity, setting, or other details that might constitute hallucination.
- While the caption doesn't capture the specific identity of the person (who appears to be Albert Einstein or an Einstein-like figure), simply describing the subject as an "old man" remains factually accurate without overreaching.

A different evaluation would only be warranted if the caption made claims about elements not visibly present in the image.

What to do If you get Undesired Results

If the caption is evaluated as containing hallucinations (Failed) and you want to improve it:

  • Stick strictly to describing what is visibly present in the image
  • Avoid making assumptions about:
    • People’s identities (unless clearly labeled or universally recognizable)
    • The location or setting (unless clearly identifiable)
    • Time periods or dates
    • Actions occurring before or after the captured moment
    • Emotions or thoughts of subjects
    • Objects that are partially obscured or ambiguous
  • Use qualifying language (like “appears to be,” “what looks like”) when uncertain
  • Focus on concrete visual elements rather than interpretations
  • For generic descriptions, stay high-level and avoid specifics that aren’t clearly visible

Comparing Caption Hallucination with Similar Evals

  • Is AI Generated Image: Caption Hallucination evaluates the accuracy of image descriptions, while Is AI Generated Image determines if the image itself was created by AI.
  • Detect Hallucination: Caption Hallucination specifically evaluates image descriptions, whereas Detect Hallucination evaluates factual fabrication in text content more broadly.
  • Factual Accuracy: Caption Hallucination focuses on whether descriptions match what’s visible in images, while Factual Accuracy evaluates correctness of factual statements more generally.