Skip to main content
These cookbooks walk you through the AI Evaluation SDK from zero to production. Each one solves a concrete problem — hallucination detection, RAG debugging, prompt injection defense, streaming safety, and more — with runnable code you can copy straight into your project.

Prerequisites

pip install ai-evaluation
Cookbooks that use an LLM judge (02, 08, 09) require a GOOGLE_API_KEY for Gemini. All other cookbooks run entirely locally with no API keys and no network calls.

Cookbooks

01 - Local Metrics

Catch hallucinations, wrong dosages, and contradictions in a medical chatbot — all locally in under one second.

02 - LLM-as-Judge

When heuristics miss paraphrases, use Gemini to judge accuracy with augment=True and custom prompts.

03 - RAG Evaluation

Diagnose where your RAG pipeline fails — is retrieval pulling the wrong docs, or is the LLM hallucinating?

04 - Guardrails

Build a sub-10ms security middleware that blocks jailbreaks, code injection, PII leaks, and secret exposure.

05 - Streaming Safety

Monitor streaming LLM output token-by-token and cut the stream the moment safety thresholds are breached.

06 - AutoEval

Describe your app in plain English and get an auto-configured test pipeline you can export to CI/CD.

07 - Feedback Loop

Store developer corrections in ChromaDB and teach your LLM judge to stop repeating the same mistakes.

08 - Multimodal Judge

Pass images and audio to the LLM judge — verify product descriptions match photos, check transcription accuracy.

Progression

The cookbooks are designed to build on each other:
StageCookbookWhat you add
Start here01 Local MetricsFast, free, local checks
Improve accuracy02 LLM-as-JudgeLLM refinement for hard cases
Debug your RAG03 RAG EvaluationSeparate retrieval vs. generation failures
Secure inputs04 GuardrailsBlock attacks before they reach the LLM
Protect in real-time05 StreamingCut off toxic output mid-stream
Automate setup06 AutoEvalGenerate pipelines from descriptions
Learn from mistakes07 Feedback LoopFew-shot feedback for the judge
Go multimodal08 Multimodal JudgeImages and audio evaluation