The Problem
Your AI generates product descriptions for an e-commerce catalog. Each product has a photo and a text description. You need to verify that descriptions actually match what is in the image — not hallucinate features, colors, or details.
You also have a customer service bot that transcribes voicemails. You need to verify transcription accuracy against the original audio.
What You Will Learn
- How to pass image URLs to the LLM judge alongside text
- How to auto-generate grading criteria from a short description using
generate_prompt=True
- How to compare multiple images (input vs output)
- How to evaluate audio transcriptions
- How the SDK remains backwards-compatible with text-only evaluation
Prerequisites
pip install ai-evaluation
export GOOGLE_API_KEY=your-gemini-api-key
This cookbook uses Gemini’s native vision and audio capabilities via LiteLLM. The images and audio files used are publicly accessible Google Cloud samples — no additional auth is needed.
Section 1: Image-Text Alignment
Pass an image_url alongside the text output to have the LLM judge evaluate whether the description matches the image.
Accurate Description
from fi.evals import evaluate
MODEL = "gemini/gemini-2.5-flash"
FLOWER_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/daisy/100080576_f52e8ee070_n.jpg"
result = evaluate(
prompt="""Rate how accurately the text description matches the image.
Score 1.0 if every detail in the description is visible in the image.
Score 0.5 if the description is partially correct but has some inaccuracies.
Score 0.0 if the description is completely wrong or describes something else.""",
output="A white daisy flower with a yellow center, growing in a garden.",
image_url=FLOWER_IMAGE,
engine="llm",
model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The judge can see the image and verify that the description accurately reflects a daisy flower.
Hallucinated Description
result = evaluate(
prompt="""Rate how accurately the text description matches the image.
Score 1.0 if every detail in the description is visible in the image.
Score 0.5 if the description is partially correct but has some inaccuracies.
Score 0.0 if the description is completely wrong or describes something else.""",
output="A golden retriever puppy playing fetch on a sandy beach.",
image_url=FLOWER_IMAGE,
engine="llm",
model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
Expected: Score near 0.0 — the image shows a flower, not a dog on a beach. The judge correctly identifies the mismatch.
Section 2: Auto-Generate Grading Criteria
Instead of writing a detailed rubric, describe what you want to evaluate in plain English and set generate_prompt=True. The SDK uses the LLM to generate a proper grading rubric automatically.
TULIP_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/tulips/10791227_7168491604.jpg"
result = evaluate(
prompt="flower species identification accuracy from photos",
output="This image shows a bright red tulip in full bloom with green stems.",
image_url=TULIP_IMAGE,
engine="llm",
model=MODEL,
generate_prompt=True,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
You can also inspect the generated criteria directly:
from fi.evals.core.prompt_generator import generate_grading_criteria
criteria = generate_grading_criteria(
"product photo quality assessment for e-commerce listings",
MODEL,
{"image_url": "...", "output": "..."},
)
print(f"Generated criteria:\n{criteria[:300]}...")
Use generate_prompt=True for rapid prototyping. Once you find criteria that work, copy the generated prompt into a custom prompt= parameter for consistency.
Section 3: Comparing Multiple Images
When you need to compare images, use input_image_url and output_image_url to provide a reference image and a candidate image.
ROSE_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/roses/12240303_80d87f77a3_n.jpg"
result = evaluate(
prompt="""You are given two images and a text description.
input_image_url is the reference product photo.
output_image_url is what the AI selected as matching.
Rate whether the text description matches the input_image_url (1.0) or
actually describes the output_image_url instead (0.0).""",
output="A pink rose flower in a garden.",
input_image_url=FLOWER_IMAGE,
output_image_url=ROSE_IMAGE,
engine="llm",
model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The description says “pink rose” — the judge can see both images and determine which one the description actually matches.
Section 4: Audio Transcription Evaluation
Pass an audio_url to evaluate transcription accuracy against the original audio.
AUDIO_URL = "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac"
result = evaluate(
prompt="""Rate how accurately the transcription captures the audio content.
Score 1.0 if the transcription is accurate and complete.
Score 0.5 if partially correct. Score 0.0 if completely wrong.""",
output="How old is the Brooklyn Bridge?",
audio_url=AUDIO_URL,
engine="llm",
model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The LLM listens to the audio and compares it against the provided transcription text.
Section 5: Text-Only Still Works
The multimodal parameters are additive. When you omit image and audio URLs, the judge works exactly as before with text only.
result = evaluate(
prompt="Rate the factual accuracy of the response given the context.",
output="The Eiffel Tower is 330 meters tall and was built in 1889.",
context="The Eiffel Tower, completed in 1889, stands at 330 metres.",
engine="llm",
model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
Supported Modality Parameters
| Parameter | Description | Example |
|---|
image_url | Single image for the judge to evaluate | Product photo URL |
input_image_url | Reference/input image for comparison | Original product image |
output_image_url | Output/candidate image for comparison | AI-selected match |
audio_url | Audio file for the judge to listen to | Voicemail recording |
All parameters accept publicly accessible URLs. The LLM processes them natively — no preprocessing or download step needed.
What to Try Next
You have completed all the cookbooks. Here are some directions to explore: