Understanding Prototype: Pre-Production Testing in Future AGI

Explains what Prototype is, the problem it solves, and how versions, traces, and evals work together before you ship to production.

About

Prototype is a pre-production testing environment for LLM applications. It gives you a structured way to run multiple configurations of your application:different prompts, models, or parameters:and compare them on real outputs before deciding what goes to production.

Without Prototype, the only way to know if a change made things better is to ship it and see. That means real users encounter regressions, hallucinations, or tone problems before you do. Prototype moves that discovery earlier: you run versions, score outputs automatically with evaluations, and compare everything in one dashboard before any version reaches production.

The core workflow

Register your project with a version name and the evaluations you want to run.
Instrument your application so every LLM call is automatically traced.
Run your application:each generation is captured, tagged to its version, and scored.
Compare versions in the Prototype dashboard by evaluation scores, cost, and latency.
Promote the best-performing version to production.

Every step is designed to be low-friction: instrumentation is automatic, scoring happens in the background, and the dashboard surfaces the comparison without manual analysis.

What gets measured

Each version run is measured on three dimensions:

Dimension	What it captures
Evaluation scores	Quality metrics like context adherence, toxicity, hallucination detection, and tone:scored automatically on every generation.
Cost	Token usage and estimated cost per generation for the model and configuration used.
Latency	Response time per generation, so you can see the performance tradeoff of different models or prompts.

These three together give you a complete picture. A cheaper model may cost less but score worse on quality. A longer prompt may improve accuracy but add latency. Prototype shows all three at once.

Key concepts

Versions and Runs: What a version is and how runs get tagged and compared.
EvalTags and Mapping: How evaluations attach to your runs and how span data maps to eval inputs.

Next steps

Set Up Prototype: Register your project and instrument your app.
Configure Evals for Prototype: Define which evaluations run on your outputs.
Choose Winner: Rank versions and promote the best to production.

Was this page helpful?

Questions & Discussion