Eval Types
The four evaluation methods in Future AGI: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker, and how modality affects which ones apply.
About
Every eval template in Future AGI uses one of four evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.
LLM as Judge
The judge model reads the response, applies the template’s criteria, and reasons about whether it passes. This is the most flexible type: it handles subjective, context-dependent, and nuanced quality checks that cannot be expressed as a fixed rule.
Requires a judge model. Configure one in Future AGI models or custom models.
Returns: a result (pass/fail, score, or category) and a plain-language reason explaining the judgment.
Examples: Groundedness, Toxicity, Task Completion, Tone, Detect Hallucination, Instruction Adherence, PII Detection, Context Adherence, and all custom evals.
Best for:
- Quality checks that require understanding meaning or intent
- Safety and policy enforcement
- RAG pipeline evaluation (context adherence, relevance, chunk attribution)
- Custom business or regulatory rules written in plain language
Deterministic / Rule-based
Computed directly from the text using code or string logic. No model is called and no API key is required. Given the same input, always returns the same output.
Does not require a judge model. Runs locally; works without an API key via the standalone evaluate() function.
Returns: pass/fail only. No reason field.
Examples: Is JSON, Is Email, Contains Valid Link, No Invalid Links, One Line.
Best for:
- Format validation (valid JSON, email address, URL presence)
- High-volume pipelines where speed and zero API cost matter
- Offline or air-gapped environments
- First-pass filtering before running LLM-based evals
Statistical Metric
Computes a numeric score using an algorithm applied to the output and a reference value. Covers overlap metrics, edit distance, semantic similarity, and information retrieval metrics. No judge model is needed for most: embedding-based metrics call an embedding model, not a generative one.
Returns: a numeric score (e.g. 0–1 or 0–100). No reason field.
Examples:
| Metric | What it measures |
|---|---|
| BLEU, ROUGE | N-gram overlap between output and reference |
| Levenshtein Similarity | Character edit distance between output and reference |
| Numeric Similarity | Numerical difference between output and reference |
| Embedding Similarity | Semantic vector similarity between output and reference |
| Semantic List Contains | Whether output contains phrases semantically similar to a reference list |
| Recall@K, Precision@K, NDCG@K, MRR, Hit Rate | Retrieval quality for RAG pipelines |
| FID Score | Distribution similarity between sets of real and generated images |
| CLIP Score | Alignment between an image and its text description |
Best for:
- Benchmarking against a ground-truth reference answer
- RAG retrieval quality (recall, precision, ranking)
- Image generation quality
- Reproducible, model-free scoring
LLM as Ranker
A variant of LLM as Judge where instead of scoring a single response, the model ranks a set of retrieved context chunks based on relevance to a query. Used specifically for evaluating retrieval ordering in RAG pipelines.
Requires a judge model.
Returns: a ranked score per context item.
Examples: Eval Ranking.
Best for:
- Evaluating whether a retrieval system surfaces the most relevant chunks at the top
- Diagnosing retrieval ordering issues in RAG pipelines
Modality
In addition to the four types above, evals also vary by the kind of input they accept:
| Modality | What it evaluates | Example evals |
|---|---|---|
| Text | Any text input or output | Most built-in evals |
| Image | Images passed as inputs | CLIP Score, FID Score, Caption Hallucination, Image Instruction Adherence, Synthetic Image Evaluator, OCR Evaluation |
| Audio | Audio files or speech | Audio Quality, Audio Transcription, TTS Accuracy |
| Conversation | Multi-turn conversation histories | Customer Agent evals (Loop Detection, Context Retention, Query Handling, etc.) |
Multimodal evals (image, audio, conversation) require a judge model that supports the relevant modality. Use turing_large or turing_small for image and audio inputs.
Quick reference
| Type | Judge model required | Returns reason | No API key possible |
|---|---|---|---|
| LLM as Judge | Yes | Yes | No |
| Deterministic | No | No | Yes |
| Statistical Metric | No (most) | No | Yes (most) |
| LLM as Ranker | Yes | No | No |
Next steps
- Built-in evals: Full list with evaluation method and required inputs for each template.
- Create custom evals: Custom evals always use LLM as Judge.
- Judge models: Choose the right model for LLM as Judge and LLM as Ranker evals.
- Eval groups: Combine different eval types and run them together in one pass.