Prototype and Iterate on LLM Applications
Register a prototype project, auto-evaluate spans with EvalTags, iterate with versioned prompts, compare versions, and choose the winner before deploying to production.
Register a Prototype project with automatic span evaluation, iterate with versioned prompts, compare versions side by side, and choose a winner before deploying to production.
| Time | Difficulty | Package |
|---|---|---|
| 15 min | Intermediate | fi-instrumentation-otel |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
- OpenAI API key
Install
pip install fi-instrumentation-otel traceAI-openai openai
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"
What is Prototype?
Prototype lets you test different LLM configurations, prompts, and parameters in a controlled environment before deploying to production. Each run is a version: you compare versions side by side on evaluation scores, cost, and latency, then choose a winner.
Tutorial
Register a prototype project (Version 1)
register() creates a tracer provider connected to FutureAGI. Setting project_type=ProjectType.EXPERIMENT creates a Prototype project. The project_version_name tags all traces from this run as a distinct version you can compare later.
EvalTag objects define which evaluations run automatically on every matching span, with no manual eval calls needed.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType,
EvalName,
EvalTag,
EvalTagType,
EvalSpanKind,
ModelChoices,
)
trace_provider = register(
project_type=ProjectType.EXPERIMENT,
project_name="support-bot-prototype",
project_version_name="v1-baseline",
eval_tags=[
EvalTag(
eval_name=EvalName.COMPLETENESS,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
model=ModelChoices.TURING_FLASH,
custom_eval_name="completeness_check",
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
},
),
EvalTag(
eval_name=EvalName.SUMMARY_QUALITY,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
model=ModelChoices.TURING_FLASH,
custom_eval_name="response_quality",
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
},
),
],
)Each EvalTag has:
eval_name: the built-in evaluation to run (e.g.EvalName.COMPLETENESS,EvalName.SUMMARY_QUALITY)type: where to apply the eval (EvalTagType.OBSERVATION_SPAN)value: which span kind to evaluate (EvalSpanKind.LLM)mapping: maps eval input keys to span attribute pathsmodel: the FutureAGI eval model to usecustom_eval_name: a label for this eval tag (must be unique per project)
Instrument and run your app
Patch the OpenAI client with OpenAIInstrumentor so every API call is automatically traced and evaluated against your EvalTag configuration.
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
client = OpenAI()
questions = [
"How do I reset my password?",
"What is your refund policy?",
"Can I upgrade my plan mid-cycle?",
]
for q in questions:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful customer support agent. Answer concisely."},
{"role": "user", "content": q},
],
)
print(f"Q: {q}")
print(f"A: {response.choices[0].message.content}\n")
trace_provider.force_flush()Expected output:
Q: How do I reset my password?
A: Go to the login page, click "Forgot Password," enter your email, and follow the reset link sent to your inbox.
Q: What is your refund policy?
A: We offer full refunds within 30 days of purchase. After 30 days, refunds are prorated.
Q: Can I upgrade my plan mid-cycle?
A: Yes, you can upgrade anytime. The price difference is prorated for the remainder of your billing cycle. View results in the Prototype dashboard
Go to app.futureagi.com, select Prototype (left sidebar under BUILD), and click your project support-bot-prototype to see version v1-baseline.
The dashboard shows:
- Every traced span with its input, output, token count, and latency
- Evaluation scores from your
EvalTagconfiguration (completeness_checkandresponse_quality) displayed alongside each span
Create Version 2 with a different prompt
This is where rapid iteration happens. Register a new version with a different project_version_name and run the same queries with an improved prompt. Each version is a separate experiment you can compare.
Warning
Each call to register() creates a new tracer provider. Run Version 2 in a separate script or after the Version 1 script completes — do not call register() twice in the same process.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType,
EvalName,
EvalTag,
EvalTagType,
EvalSpanKind,
ModelChoices,
)
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI
trace_provider_v2 = register(
project_type=ProjectType.EXPERIMENT,
project_name="support-bot-prototype",
project_version_name="v2-detailed",
eval_tags=[
EvalTag(
eval_name=EvalName.COMPLETENESS,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
model=ModelChoices.TURING_FLASH,
custom_eval_name="completeness_check",
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
},
),
EvalTag(
eval_name=EvalName.SUMMARY_QUALITY,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
model=ModelChoices.TURING_FLASH,
custom_eval_name="response_quality",
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
},
),
],
)
OpenAIInstrumentor().uninstrument()
OpenAIInstrumentor().instrument(tracer_provider=trace_provider_v2)
client = OpenAI()
questions = [
"How do I reset my password?",
"What is your refund policy?",
"Can I upgrade my plan mid-cycle?",
]
for q in questions:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are a knowledgeable customer support agent. "
"Provide detailed, step-by-step answers. "
"Include any relevant edge cases or exceptions. "
"End with a follow-up question to confirm the issue is resolved."
),
},
{"role": "user", "content": q},
],
)
print(f"Q: {q}")
print(f"A: {response.choices[0].message.content}\n")
trace_provider_v2.force_flush()Expected output:
Q: How do I reset my password?
A: Here's how to reset your password step by step:
1. Go to our login page at app.example.com
2. Click "Forgot Password" below the sign-in button
3. Enter the email address associated with your account
4. Check your inbox for a reset link (check spam if you don't see it within 5 minutes)
5. Click the link and enter your new password
Note: The reset link expires after 24 hours. If it expires, repeat the process.
Is there anything else about your account access I can help with?
Q: What is your refund policy?
... Compare versions in the dashboard
Back in the Prototype dashboard, your project now shows two versions: v1-baseline and v2-detailed.
Click any version to see its individual traces and eval scores. The project overview shows aggregate metrics across all versions — average eval scores, latency, token usage, and cost — so you can compare at a glance.
Choose the winner
Once you have compared evaluation scores, latency, and cost across versions, choose a winner.
- Go to Prototype → click your project
- Click Choose Winner — a Winner Settings drawer opens
- Under Evaluation Metrics, adjust the importance slider (0 = Not Important, 10 = Very Important) for each eval —
completeness_checkandresponse_quality - Under System Metrics, adjust the importance sliders for Avg Cost and Avg Latency
- Click Choose Winner to rank all versions
The version with the highest weighted score across your chosen importance values is selected as the winner.
What you built
You can now register a Prototype project, auto-evaluate spans with EvalTags, iterate with versioned prompts, compare versions, and choose the best one for production.
- Registered a Prototype project with
ProjectType.EXPERIMENTand automatic span evaluation viaEvalTag - Ran a baseline OpenAI app (v1) and saw completeness and response quality scores in the dashboard
- Iterated with a new prompt version (v2) using a different
project_version_name - Compared both versions on eval scores, latency, and cost in the Prototype dashboard
- Chose the winning version using weighted metric comparison