Multimodal Evaluation: Images, Audio, and PDF

Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output using built-in multimodal eval metrics.

📝

TL;DR

Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output against source PDFs using built-in multimodal eval metrics.

Time	Difficulty	Package
10 min	Intermediate	`ai-evaluation`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Install

pip install ai-evaluation

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

Tutorial

Set up the Evaluator

import os
from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)

Detect caption hallucination

Check whether a caption accurately describes an image. Pass the image as a URL (or base64) and the caption as text.

# Accurate caption
result = evaluator.evaluate(
    eval_templates="caption_hallucination",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
        "caption": "A pair of white sneakers with a wavy sole design.",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Try a hallucinated caption against the same image:

result = evaluator.evaluate(
    eval_templates="caption_hallucination",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
        "caption": "A red leather handbag with gold buckles on a wooden table.",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Detect AI-generated images

Score whether an image was generated by AI or is a real photograph.

result = evaluator.evaluate(
    eval_templates="synthetic_image_evaluator",
    inputs={
        "image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
    },
    model_name="turing_small",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate audio quality

Get a Mean Opinion Score (MOS) assessment of audio quality. Pass the audio file as a URL or base64.

Warning

audio_quality and ASR/STT_accuracy require model_name="turing_large".

result = evaluator.evaluate(
    eval_templates="audio_quality",
    inputs={
        "input_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate text-to-speech accuracy

Check whether a TTS audio output accurately reflects the original text (including pronunciation, emphasis, and tone).

result = evaluator.evaluate(
    eval_templates="TTS_accuracy",
    inputs={
        "text": "Welcome to FutureAGI. Our platform helps you evaluate and optimize AI applications.",
        "generated_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Evaluate OCR output against a PDF

Score how accurately OCR-extracted content matches the source PDF document.

result = evaluator.evaluate(
    eval_templates="ocr_evaluation",
    inputs={
        "input_pdf": "https://your-bucket.s3.amazonaws.com/sample-invoice.pdf",
        "json_content": '{"invoice_number": "INV-2024-001", "total": "$1,250.00", "date": "2024-03-15"}',
    },
    model_name="turing_large",
)

eval_result = result.eval_results[0]
print(f"Score:  {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Run multimodal evals from the dashboard

You can also run these evals directly from the FutureAGI platform without writing any code.

Go to Datasets and create or open a dataset
Add columns for your multimodal inputs (e.g. an image column with image URLs, or an audio column with audio URLs)
Click Add Evaluation and select a multimodal eval (e.g. caption_hallucination, audio_quality)
Map the eval’s required keys to your dataset columns (e.g. image → your image column, caption → your caption column)
Choose a Turing model and click Run
View scores alongside each row in the dataset

This is the same approach shown in the Dataset SDK cookbook and Dataset Management cookbook, but with multimodal columns instead of text-only.

Eval reference

Eval	Inputs	Output	Turing models
`caption_hallucination`	`image` (URL/base64), `caption` (text)	Pass/Fail	all
`synthetic_image_evaluator`	`image` (URL/base64)	Score	all
`audio_quality`	`input_audio` (URL/base64)	Score	`turing_large` only
`TTS_accuracy`	`text`, `generated_audio` (URL/base64)	Score	all
`ASR/STT_accuracy`	`audio` (URL/base64), `generated_transcript` (text)	Score	`turing_large` only
`ocr_evaluation`	`input_pdf` (URL/base64), `json_content` (text)	Score	`turing_large` only

Tip

Many text evals also accept image, audio, and PDF inputs. For example, detect_hallucination and context_adherence can take audio or images in their input keys. See the built-in eval metrics for the full list.

What you built

You can now evaluate images, audio, PDFs, and captions using built-in multimodal eval metrics and the FutureAGI dashboard.

Detected caption hallucinations by scoring text against a source image
Checked whether an image is AI-generated with synthetic_image_evaluator
Scored audio quality using MOS evaluation
Evaluated text-to-speech accuracy by comparing source text against generated audio
Verified OCR output against a source PDF document

Questions & Discussion