Eval Types

The four evaluation methods in Future AGI: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker, and how modality affects which ones apply.

About

Every eval template in Future AGI uses one of four evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.


LLM as Judge

The judge model reads the response, applies the template’s criteria, and reasons about whether it passes. This is the most flexible type: it handles subjective, context-dependent, and nuanced quality checks that cannot be expressed as a fixed rule.

Requires a judge model. Configure one in Future AGI models or custom models.

Returns: a result (pass/fail, score, or category) and a plain-language reason explaining the judgment.

Examples: Groundedness, Toxicity, Task Completion, Tone, Detect Hallucination, Instruction Adherence, PII Detection, Context Adherence, and all custom evals.

Best for:

  • Quality checks that require understanding meaning or intent
  • Safety and policy enforcement
  • RAG pipeline evaluation (context adherence, relevance, chunk attribution)
  • Custom business or regulatory rules written in plain language

Deterministic / Rule-based

Computed directly from the text using code or string logic. No model is called and no API key is required. Given the same input, always returns the same output.

Does not require a judge model. Runs locally; works without an API key via the standalone evaluate() function.

Returns: pass/fail only. No reason field.

Examples: Is JSON, Is Email, Contains Valid Link, No Invalid Links, One Line.

Best for:

  • Format validation (valid JSON, email address, URL presence)
  • High-volume pipelines where speed and zero API cost matter
  • Offline or air-gapped environments
  • First-pass filtering before running LLM-based evals

Statistical Metric

Computes a numeric score using an algorithm applied to the output and a reference value. Covers overlap metrics, edit distance, semantic similarity, and information retrieval metrics. No judge model is needed for most: embedding-based metrics call an embedding model, not a generative one.

Returns: a numeric score (e.g. 0–1 or 0–100). No reason field.

Examples:

MetricWhat it measures
BLEU, ROUGEN-gram overlap between output and reference
Levenshtein SimilarityCharacter edit distance between output and reference
Numeric SimilarityNumerical difference between output and reference
Embedding SimilaritySemantic vector similarity between output and reference
Semantic List ContainsWhether output contains phrases semantically similar to a reference list
Recall@K, Precision@K, NDCG@K, MRR, Hit RateRetrieval quality for RAG pipelines
FID ScoreDistribution similarity between sets of real and generated images
CLIP ScoreAlignment between an image and its text description

Best for:

  • Benchmarking against a ground-truth reference answer
  • RAG retrieval quality (recall, precision, ranking)
  • Image generation quality
  • Reproducible, model-free scoring

LLM as Ranker

A variant of LLM as Judge where instead of scoring a single response, the model ranks a set of retrieved context chunks based on relevance to a query. Used specifically for evaluating retrieval ordering in RAG pipelines.

Requires a judge model.

Returns: a ranked score per context item.

Examples: Eval Ranking.

Best for:

  • Evaluating whether a retrieval system surfaces the most relevant chunks at the top
  • Diagnosing retrieval ordering issues in RAG pipelines

Modality

In addition to the four types above, evals also vary by the kind of input they accept:

ModalityWhat it evaluatesExample evals
TextAny text input or outputMost built-in evals
ImageImages passed as inputsCLIP Score, FID Score, Caption Hallucination, Image Instruction Adherence, Synthetic Image Evaluator, OCR Evaluation
AudioAudio files or speechAudio Quality, Audio Transcription, TTS Accuracy
ConversationMulti-turn conversation historiesCustomer Agent evals (Loop Detection, Context Retention, Query Handling, etc.)

Multimodal evals (image, audio, conversation) require a judge model that supports the relevant modality. Use turing_large or turing_small for image and audio inputs.


Quick reference

TypeJudge model requiredReturns reasonNo API key possible
LLM as JudgeYesYesNo
DeterministicNoNoYes
Statistical MetricNo (most)NoYes (most)
LLM as RankerYesNoNo

Next steps

  • Built-in evals: Full list with evaluation method and required inputs for each template.
  • Create custom evals: Custom evals always use LLM as Judge.
  • Judge models: Choose the right model for LLM as Judge and LLM as Ranker evals.
  • Eval groups: Combine different eval types and run them together in one pass.
Was this page helpful?

Questions & Discussion