Skip to main content

The Problem

Your AI generates product descriptions for an e-commerce catalog. Each product has a photo and a text description. You need to verify that descriptions actually match what is in the image — not hallucinate features, colors, or details. You also have a customer service bot that transcribes voicemails. You need to verify transcription accuracy against the original audio.

What You Will Learn

  • How to pass image URLs to the LLM judge alongside text
  • How to auto-generate grading criteria from a short description using generate_prompt=True
  • How to compare multiple images (input vs output)
  • How to evaluate audio transcriptions
  • How the SDK remains backwards-compatible with text-only evaluation

Prerequisites

pip install ai-evaluation
export GOOGLE_API_KEY=your-gemini-api-key
This cookbook uses Gemini’s native vision and audio capabilities via LiteLLM. The images and audio files used are publicly accessible Google Cloud samples — no additional auth is needed.

Section 1: Image-Text Alignment

Pass an image_url alongside the text output to have the LLM judge evaluate whether the description matches the image.

Accurate Description

from fi.evals import evaluate

MODEL = "gemini/gemini-2.5-flash"

FLOWER_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/daisy/100080576_f52e8ee070_n.jpg"

result = evaluate(
    prompt="""Rate how accurately the text description matches the image.
    Score 1.0 if every detail in the description is visible in the image.
    Score 0.5 if the description is partially correct but has some inaccuracies.
    Score 0.0 if the description is completely wrong or describes something else.""",
    output="A white daisy flower with a yellow center, growing in a garden.",
    image_url=FLOWER_IMAGE,
    engine="llm",
    model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The judge can see the image and verify that the description accurately reflects a daisy flower.

Hallucinated Description

result = evaluate(
    prompt="""Rate how accurately the text description matches the image.
    Score 1.0 if every detail in the description is visible in the image.
    Score 0.5 if the description is partially correct but has some inaccuracies.
    Score 0.0 if the description is completely wrong or describes something else.""",
    output="A golden retriever puppy playing fetch on a sandy beach.",
    image_url=FLOWER_IMAGE,
    engine="llm",
    model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
Expected: Score near 0.0 — the image shows a flower, not a dog on a beach. The judge correctly identifies the mismatch.

Section 2: Auto-Generate Grading Criteria

Instead of writing a detailed rubric, describe what you want to evaluate in plain English and set generate_prompt=True. The SDK uses the LLM to generate a proper grading rubric automatically.
TULIP_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/tulips/10791227_7168491604.jpg"

result = evaluate(
    prompt="flower species identification accuracy from photos",
    output="This image shows a bright red tulip in full bloom with green stems.",
    image_url=TULIP_IMAGE,
    engine="llm",
    model=MODEL,
    generate_prompt=True,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
You can also inspect the generated criteria directly:
from fi.evals.core.prompt_generator import generate_grading_criteria

criteria = generate_grading_criteria(
    "product photo quality assessment for e-commerce listings",
    MODEL,
    {"image_url": "...", "output": "..."},
)
print(f"Generated criteria:\n{criteria[:300]}...")
Use generate_prompt=True for rapid prototyping. Once you find criteria that work, copy the generated prompt into a custom prompt= parameter for consistency.

Section 3: Comparing Multiple Images

When you need to compare images, use input_image_url and output_image_url to provide a reference image and a candidate image.
ROSE_IMAGE = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/roses/12240303_80d87f77a3_n.jpg"

result = evaluate(
    prompt="""You are given two images and a text description.
    input_image_url is the reference product photo.
    output_image_url is what the AI selected as matching.
    Rate whether the text description matches the input_image_url (1.0) or
    actually describes the output_image_url instead (0.0).""",
    output="A pink rose flower in a garden.",
    input_image_url=FLOWER_IMAGE,
    output_image_url=ROSE_IMAGE,
    engine="llm",
    model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The description says “pink rose” — the judge can see both images and determine which one the description actually matches.

Section 4: Audio Transcription Evaluation

Pass an audio_url to evaluate transcription accuracy against the original audio.
AUDIO_URL = "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.flac"

result = evaluate(
    prompt="""Rate how accurately the transcription captures the audio content.
    Score 1.0 if the transcription is accurate and complete.
    Score 0.5 if partially correct. Score 0.0 if completely wrong.""",
    output="How old is the Brooklyn Bridge?",
    audio_url=AUDIO_URL,
    engine="llm",
    model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")
The LLM listens to the audio and compares it against the provided transcription text.

Section 5: Text-Only Still Works

The multimodal parameters are additive. When you omit image and audio URLs, the judge works exactly as before with text only.
result = evaluate(
    prompt="Rate the factual accuracy of the response given the context.",
    output="The Eiffel Tower is 330 meters tall and was built in 1889.",
    context="The Eiffel Tower, completed in 1889, stands at 330 metres.",
    engine="llm",
    model=MODEL,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason[:150]}")

Supported Modality Parameters

ParameterDescriptionExample
image_urlSingle image for the judge to evaluateProduct photo URL
input_image_urlReference/input image for comparisonOriginal product image
output_image_urlOutput/candidate image for comparisonAI-selected match
audio_urlAudio file for the judge to listen toVoicemail recording
All parameters accept publicly accessible URLs. The LLM processes them natively — no preprocessing or download step needed.

What to Try Next

You have completed all the cookbooks. Here are some directions to explore: