Multimodal Evaluation: Images, Audio, and PDF

Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output against source PDFs using built-in eval metrics.

📝
TL;DR

Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output against source PDFs using built-in multimodal eval metrics.

Open in ColabGitHub
TimeDifficultyPackage
10 minIntermediateai-evaluation
Prerequisites

Install

pip install ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

Tutorial

Set up the Evaluator

import os
from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)

Detect caption hallucination

Check whether a caption accurately describes an image. Pass the image as a URL (or base64) and the caption as text.

# Accurate caption
result = evaluator.evaluate(
    eval_templates="caption_hallucination",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
        "caption": "A pair of white sneakers with a wavy sole design.",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Try a hallucinated caption against the same image:

result = evaluator.evaluate(
    eval_templates="caption_hallucination",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
        "caption": "A red leather handbag with gold buckles on a wooden table.",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Detect AI-generated images

Score whether an image was generated by AI or is a real photograph.

result = evaluator.evaluate(
    eval_templates="synthetic_image_evaluator",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate audio quality

Get a Mean Opinion Score (MOS) assessment of audio quality. Pass the audio file as a URL or base64.

Warning

audio_quality and ASR/STT_accuracy require model_name="turing_large".

result = evaluator.evaluate(
    eval_templates="audio_quality",
    inputs={
        "input_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate text-to-speech accuracy

Check whether a TTS audio output accurately reflects the original text (including pronunciation, emphasis, and tone).

result = evaluator.evaluate(
    eval_templates="TTS_accuracy",
    inputs={
        "text": "Welcome to FutureAGI. Our platform helps you evaluate and optimize AI applications.",
        "generated_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate OCR output against a PDF

Score how accurately OCR-extracted content matches the source PDF document.

result = evaluator.evaluate(
    eval_templates="ocr_evaluation",
    inputs={
        "input_pdf": "https://your-bucket.s3.amazonaws.com/sample-invoice.pdf",
        "json_content": '{"invoice_number": "INV-2024-001", "total": "$1,250.00", "date": "2024-03-15"}',
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Run multimodal evals from the dashboard

You can also run these evals directly from the FutureAGI platform without writing any code.

  1. Go to Datasets and create or open a dataset
  2. Add columns for your multimodal inputs (e.g. an image column with image URLs, or an audio column with audio URLs)
  3. Click Add Evaluation and select a multimodal eval (e.g. caption_hallucination, audio_quality)
  4. Map the eval’s required keys to your dataset columns (e.g. image → your image column, caption → your caption column)
  5. Choose a Turing model and click Run
  6. View scores alongside each row in the dataset

This is the same approach shown in the Dataset SDK cookbook and Dataset Management cookbook, but with multimodal columns instead of text-only.

Eval reference

EvalInputsOutputTuring models
caption_hallucinationimage (URL/base64), caption (text)Pass/Failall
synthetic_image_evaluatorimage (URL/base64)Scoreall
audio_qualityinput_audio (URL/base64)Scoreturing_large only
TTS_accuracytext, generated_audio (URL/base64)Scoreall
ASR/STT_accuracyaudio (URL/base64), generated_transcript (text)Scoreturing_large only
ocr_evaluationinput_pdf (URL/base64), json_content (text)Scoreturing_large only

Tip

Many text evals also accept image, audio, and PDF inputs. For example, detect_hallucination and context_adherence can take audio or images in their input keys. See the built-in eval metrics for the full list.

What you built

You can now evaluate images, audio, PDFs, and captions using built-in multimodal eval metrics and the FutureAGI dashboard.

  • Detected caption hallucinations by scoring text against a source image
  • Checked whether an image is AI-generated with synthetic_image_evaluator
  • Scored audio quality using MOS evaluation
  • Evaluated text-to-speech accuracy by comparing source text against generated audio
  • Verified OCR output against a source PDF document

Next steps

Was this page helpful?

Questions & Discussion