Understanding Prototype
What Prototype is, the problem it solves, and how versions, traces, and evals work together before you ship.
About
Prototype is a pre-production testing environment for LLM applications. It gives you a structured way to run multiple configurations of your application:different prompts, models, or parameters:and compare them on real outputs before deciding what goes to production.
Without Prototype, the only way to know if a change made things better is to ship it and see. That means real users encounter regressions, hallucinations, or tone problems before you do. Prototype moves that discovery earlier: you run versions, score outputs automatically with evaluations, and compare everything in one dashboard before any version reaches production.
The core workflow
- Register your project with a version name and the evaluations you want to run.
- Instrument your application so every LLM call is automatically traced.
- Run your application:each generation is captured, tagged to its version, and scored.
- Compare versions in the Prototype dashboard by evaluation scores, cost, and latency.
- Promote the best-performing version to production.
Every step is designed to be low-friction: instrumentation is automatic, scoring happens in the background, and the dashboard surfaces the comparison without manual analysis.
What gets measured
Each version run is measured on three dimensions:
| Dimension | What it captures |
|---|---|
| Evaluation scores | Quality metrics like context adherence, toxicity, hallucination detection, and tone:scored automatically on every generation. |
| Cost | Token usage and estimated cost per generation for the model and configuration used. |
| Latency | Response time per generation, so you can see the performance tradeoff of different models or prompts. |
These three together give you a complete picture. A cheaper model may cost less but score worse on quality. A longer prompt may improve accuracy but add latency. Prototype shows all three at once.
Key concepts
- Versions and Runs: What a version is and how runs get tagged and compared.
- EvalTags and Mapping: How evaluations attach to your runs and how span data maps to eval inputs.
Next steps
- Set Up Prototype: Register your project and instrument your app.
- Configure Evals for Prototype: Define which evaluations run on your outputs.
- Choose Winner: Rank versions and promote the best to production.