Judge Models

What a judge model is, how it scores responses, and how to choose the right one for your evaluation.

About

A judge model is the model that reads each response and applies the eval template criteria to produce a result. When you run an evaluation, the judge receives the text to evaluate, the template’s rule prompt, and the required inputs, then returns a result and a reason.

The judge model determines how accurately and how quickly each response gets scored. Choosing the right one lets you balance precision and performance for your specific workload.


How a judge scores a response

  1. The platform constructs a prompt from the eval template criteria and the row’s input values.
  2. The judge model receives this prompt and reads the response being evaluated.
  3. The judge returns a result (pass/fail, score, or category) and a reason explaining the judgment.
  4. The platform stores the result and reason for that row.

The judge model does not generate or modify your AI’s responses. It only reads and scores them.


Available judge models

Future AGI provides a set of proprietary models built specifically for evaluation:

ModelCodeBest forLatency
TURING_LARGEturing_largeMax accuracy, multimodal evals (text, image, audio)Higher
TURING_SMALLturing_smallHigh fidelity at lower cost (text, image)Medium
TURING_FLASHturing_flashFast, high-accuracy evals (text, image)Low
PROTECTprotectSafety, guardrails, user-defined rules (text, audio)Low
PROTECT_FLASHprotect_flashFirst-pass binary filtering (text only)Ultra-low

See Future AGI models for full details on each model.

You can also bring your own model using the custom models integration. This is useful when you need a domain-specific fine-tuned model, want to keep inference in a specific cloud region, or already pay for a model you want to use as the judge.


How to choose a judge

SituationRecommended model
Maximum accuracy matters more than speedturing_large
High quality at reasonable costturing_small
Large-scale runs where speed is importantturing_flash
Safety and guardrail checksprotect or protect_flash
Evaluating images or audioturing_large or turing_small
Domain-specific or compliance requirementsCustom model

Next steps

Was this page helpful?

Questions & Discussion