Multimodal Evaluation: Images, Audio, and PDF
Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output against source PDFs using built-in eval metrics.
Score image captions, detect AI-generated images, evaluate audio quality and TTS accuracy, and verify OCR output against source PDFs using built-in multimodal eval metrics.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Intermediate | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
Tutorial
Set up the Evaluator
import os
from fi.evals import Evaluator
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
) Detect caption hallucination
Check whether a caption accurately describes an image. Pass the image as a URL (or base64) and the caption as text.
# Accurate caption
result = evaluator.evaluate(
eval_templates="caption_hallucination",
inputs={
"image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
"caption": "A pair of white sneakers with a wavy sole design.",
},
model_name="turing_small",
)
eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}")Try a hallucinated caption against the same image:
result = evaluator.evaluate(
eval_templates="caption_hallucination",
inputs={
"image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
"caption": "A red leather handbag with gold buckles on a wooden table.",
},
model_name="turing_small",
)
eval_result = result.eval_results[0]
print(f"Passed: {eval_result.output}")
print(f"Reason: {eval_result.reason}") Detect AI-generated images
Score whether an image was generated by AI or is a real photograph.
result = evaluator.evaluate(
eval_templates="synthetic_image_evaluator",
inputs={
"image": "https://raw.githubusercontent.com/future-agi/cookbooks/main/ecom_agent/observe/generated_products/nike_air_max_sneakers.png",
},
model_name="turing_small",
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") Evaluate audio quality
Get a Mean Opinion Score (MOS) assessment of audio quality. Pass the audio file as a URL or base64.
Warning
audio_quality and ASR/STT_accuracy require model_name="turing_large".
result = evaluator.evaluate(
eval_templates="audio_quality",
inputs={
"input_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
},
model_name="turing_large",
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") Evaluate text-to-speech accuracy
Check whether a TTS audio output accurately reflects the original text (including pronunciation, emphasis, and tone).
result = evaluator.evaluate(
eval_templates="TTS_accuracy",
inputs={
"text": "Welcome to FutureAGI. Our platform helps you evaluate and optimize AI applications.",
"generated_audio": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac",
},
model_name="turing_large",
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") Evaluate OCR output against a PDF
Score how accurately OCR-extracted content matches the source PDF document.
result = evaluator.evaluate(
eval_templates="ocr_evaluation",
inputs={
"input_pdf": "https://your-bucket.s3.amazonaws.com/sample-invoice.pdf",
"json_content": '{"invoice_number": "INV-2024-001", "total": "$1,250.00", "date": "2024-03-15"}',
},
model_name="turing_large",
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") Run multimodal evals from the dashboard
You can also run these evals directly from the FutureAGI platform without writing any code.
- Go to Datasets and create or open a dataset
- Add columns for your multimodal inputs (e.g. an
imagecolumn with image URLs, or anaudiocolumn with audio URLs) - Click Add Evaluation and select a multimodal eval (e.g.
caption_hallucination,audio_quality) - Map the eval’s required keys to your dataset columns (e.g.
image→ your image column,caption→ your caption column) - Choose a Turing model and click Run
- View scores alongside each row in the dataset
This is the same approach shown in the Dataset SDK cookbook and Dataset Management cookbook, but with multimodal columns instead of text-only.
Eval reference
| Eval | Inputs | Output | Turing models |
|---|---|---|---|
caption_hallucination | image (URL/base64), caption (text) | Pass/Fail | all |
synthetic_image_evaluator | image (URL/base64) | Score | all |
audio_quality | input_audio (URL/base64) | Score | turing_large only |
TTS_accuracy | text, generated_audio (URL/base64) | Score | all |
ASR/STT_accuracy | audio (URL/base64), generated_transcript (text) | Score | turing_large only |
ocr_evaluation | input_pdf (URL/base64), json_content (text) | Score | turing_large only |
Tip
Many text evals also accept image, audio, and PDF inputs. For example, detect_hallucination and context_adherence can take audio or images in their input keys. See the built-in eval metrics for the full list.
What you built
You can now evaluate images, audio, PDFs, and captions using built-in multimodal eval metrics and the FutureAGI dashboard.
- Detected caption hallucinations by scoring text against a source image
- Checked whether an image is AI-generated with
synthetic_image_evaluator - Scored audio quality using MOS evaluation
- Evaluated text-to-speech accuracy by comparing source text against generated audio
- Verified OCR output against a source PDF document