The Problem
Your medical chatbot’s local faithfulness check gives a score of 0.4 to “Take ibuprofen twice daily” when the context says “Prescribe ibuprofen 2x per day.” The heuristic does not understand that “twice daily” and “2x per day” mean the same thing. You need a smarter judge. This cookbook shows three ways to use an LLM as your evaluation judge:augment=True— local heuristic first, then LLM refines the judgment (best of both worlds)- Custom prompt — write your own domain-specific rubric
- Direct LLM — bypass heuristics entirely with
engine="llm"
What You Will Learn
- How
augment=Truecombines local speed with LLM intelligence - How to write custom evaluation prompts for any domain
- How to run a batch QA review pipeline that flags responses for human review
- How to build a tone/empathy judge for customer support
Prerequisites
This cookbook requires a
GOOGLE_API_KEY for Gemini. The SDK uses LiteLLM under the hood, so any LiteLLM-compatible model string works (e.g., openai/gpt-4o, anthropic/claude-sonnet-4-20250514).Solution 1: augment=True
The simplest upgrade. Passmodel= and augment=True to any built-in metric. The SDK runs the local heuristic first, then sends the heuristic result plus the inputs to the LLM for refinement.
Solution 2: Custom Domain-Specific Judge
For specialized domains, write a prompt that encodes your own evaluation criteria. Use{context}, {output}, and {input} as placeholders — the SDK fills them in automatically.
Medical Accuracy Judge
Customer Support Tone Judge
The same pattern works for any domain. Here is a judge that evaluates empathy and professionalism in customer support:Use Case: Automated QA Review Pipeline
In production, you likely have a batch of chatbot responses to review before deployment. Useaugment=True to score each one and flag failures for human review.
What to Try Next
Now that you can judge individual responses, learn how to diagnose failures across an entire RAG pipeline — separating retrieval problems from generation problems.Next: RAG Evaluation
Measure retrieval quality and generation quality independently to know exactly what to fix.