Judge Models: Choosing the Right Evaluation Scorer

Explains what a judge model is, how it scores AI responses, and how to choose the right judge model for your specific evaluation use case.

About

A judge model is the model that reads each response and applies the eval template criteria to produce a result. When you run an evaluation, the judge receives the text to evaluate, the template’s rule prompt, and the required inputs, then returns a result and a reason.

The judge model determines how accurately and how quickly each response gets scored. Choosing the right one lets you balance precision and performance for your specific workload.

How a judge scores a response

The platform constructs a prompt from the eval template criteria and the row’s input values.
The judge model receives this prompt and reads the response being evaluated.
The judge returns a result (pass/fail, score, or category) and a reason explaining the judgment.
The platform stores the result and reason for that row.

The judge model does not generate or modify your AI’s responses. It only reads and scores them.

Available judge models

Future AGI provides a set of proprietary models built specifically for evaluation:

Model	Code	Best for	Latency
TURING_LARGE	`turing_large`	Max accuracy, multimodal evals (text, image, audio)	Higher
TURING_SMALL	`turing_small`	High fidelity at lower cost (text, image)	Medium
TURING_FLASH	`turing_flash`	Fast, high-accuracy evals (text, image)	Low
PROTECT	`protect`	Safety, guardrails, user-defined rules (text, audio)	Low
PROTECT_FLASH	`protect_flash`	First-pass binary filtering (text only)	Ultra-low

See Future AGI models for full details on each model.

You can also bring your own model using the custom models integration. This is useful when you need a domain-specific fine-tuned model, want to keep inference in a specific cloud region, or already pay for a model you want to use as the judge.

How to choose a judge

Situation	Recommended model
Maximum accuracy matters more than speed	`turing_large`
High quality at reasonable cost	`turing_small`
Large-scale runs where speed is important	`turing_flash`
Safety and guardrail checks	`protect` or `protect_flash`
Evaluating images or audio	`turing_large` or `turing_small`
Domain-specific or compliance requirements	Custom model

Next steps

Future AGI models: Full reference for built-in judge models.
Use custom models: Bring your own model as the judge.
Eval templates: The criteria the judge applies.
Eval results: What the judge produces after scoring.

Was this page helpful?

Questions & Discussion