Eval types

The three eval types: Agent Evaluator, LLM-as-Judge, and Code Eval, and when to reach for each

Three ways to reach a verdict

Every eval is authored as one of three types. The type decides how the eval reaches its verdict, what it can reason over, and whether an evaluator model is involved. It’s the first choice you make when you create an eval.

Agent Evaluator

A reasoning evaluator that runs multiple turns and can use tools and knowledge bases to reach a judgment. Reach for it when a single prompt cannot capture the check, like scoring a multi-step agent transcript or a judgment that needs to look something up. It is the most capable type, and the newest.

LLM-as-Judge

An evaluator model reads the response, applies the template’s criteria in a single pass, and returns a result plus a reason. It is the workhorse for subjective, context-dependent quality: safety, tone, faithfulness, instruction adherence, and any custom rule you can write in plain language.

Code Eval

Deterministic logic runs in a sandbox (Python or JavaScript) and computes the result directly from the text. Given the same input it always returns the same output and calls no model. This is the type behind format validation (valid JSON, email, URL), overlap and similarity metrics (BLEU, ROUGE, edit distance, embedding similarity), and retrieval metrics (recall@k, precision@k, NDCG, MRR).

One check, three ways

Take one check: is this answer grounded in the source document?

A Code Eval can only measure overlap: an embedding-similarity score between the answer and the source. Fast and repeatable, but a paraphrased hallucination can slip through
An LLM-as-Judge reads both and judges faithfulness in one pass, returning a verdict and the reason a claim isn’t supported
An Agent Evaluator goes further: it can search the knowledge base behind the answer and verify it claim by claim

Same check, three depths. The deeper you go, the more the verdict costs in time and determinism.

Which type to reach for

The three types are a ladder: a Code Eval computes the verdict, an LLM-as-Judge reads and judges it, an Agent Evaluator investigates before judging. Each step up buys reasoning power and costs speed and determinism, so climb only as high as the check requires.

%%{init: {"flowchart": {"curve": "linear"}}}%%
flowchart TD
  Q1["Objective and computable?"] -->|"Yes"| CE["Code Eval"]
  Q1 -->|"No"| Q2["One read of the response enough?"]
  Q2 -->|"Yes"| LJ["LLM-as-Judge"]
  Q2 -->|"No"| AE["Agent Evaluator"]

Type	Evaluator model	Returns a reason
Agent Evaluator	Yes	Yes
LLM-as-Judge	Yes	Yes
Code Eval	No	No

Questions & Discussion

Eval types

Three ways to reach a verdict

Agent Evaluator

LLM-as-Judge

Code Eval

One check, three ways

Which type to reach for

Keep exploring

Evaluator models

Eval templates & versions

Built-in evals

Create a custom eval