Prototype and Iterate on LLM Applications

Register a Prototype project with automatic span evaluation, iterate with versioned prompts, and compare versions before deploying to production.

📝

TL;DR

Register a Prototype project with automatic span evaluation, iterate with versioned prompts, compare versions side by side, and choose a winner before deploying to production.

Time	Difficulty	Package
15 min	Intermediate	`fi-instrumentation-otel`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+
OpenAI API key

Install

pip install fi-instrumentation-otel traceAI-openai openai

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"

What is Prototype?

Prototype lets you test different LLM configurations, prompts, and parameters in a controlled environment before deploying to production. Each run is a version: you compare versions side by side on evaluation scores, cost, and latency, then choose a winner.

Tutorial

Register a prototype project (Version 1)

register() creates a tracer provider connected to FutureAGI. Setting project_type=ProjectType.EXPERIMENT creates a Prototype project. The project_version_name tags all traces from this run as a distinct version you can compare later.

EvalTag objects define which evaluations run automatically on every matching span, with no manual eval calls needed.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType,
    EvalName,
    EvalTag,
    EvalTagType,
    EvalSpanKind,
    ModelChoices,
)

trace_provider = register(
    project_type=ProjectType.EXPERIMENT,
    project_name="support-bot-prototype",
    project_version_name="v1-baseline",
    eval_tags=[
        EvalTag(
            eval_name=EvalName.COMPLETENESS,
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            model=ModelChoices.TURING_FLASH,
            custom_eval_name="completeness_check",
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
            },
        ),
        EvalTag(
            eval_name=EvalName.SUMMARY_QUALITY,
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            model=ModelChoices.TURING_FLASH,
            custom_eval_name="response_quality",
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
            },
        ),
    ],
)

Each EvalTag has:

eval_name: the built-in evaluation to run (e.g. EvalName.COMPLETENESS, EvalName.SUMMARY_QUALITY)
type: where to apply the eval (EvalTagType.OBSERVATION_SPAN)
value: which span kind to evaluate (EvalSpanKind.LLM)
mapping: maps eval input keys to span attribute paths
model: the FutureAGI eval model to use
custom_eval_name: a label for this eval tag (must be unique per project)

Instrument and run your app

Patch the OpenAI client with OpenAIInstrumentor so every API call is automatically traced and evaluated against your EvalTag configuration.

from traceai_openai import OpenAIInstrumentor
from openai import OpenAI

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

client = OpenAI()

questions = [
    "How do I reset my password?",
    "What is your refund policy?",
    "Can I upgrade my plan mid-cycle?",
]

for q in questions:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent. Answer concisely."},
            {"role": "user", "content": q},
        ],
    )
    print(f"Q: {q}")
    print(f"A: {response.choices[0].message.content}\n")

trace_provider.force_flush()

Expected output:

Q: How do I reset my password?
A: Go to the login page, click "Forgot Password," enter your email, and follow the reset link sent to your inbox.

Q: What is your refund policy?
A: We offer full refunds within 30 days of purchase. After 30 days, refunds are prorated.

Q: Can I upgrade my plan mid-cycle?
A: Yes, you can upgrade anytime. The price difference is prorated for the remainder of your billing cycle.

View results in the Prototype dashboard

Go to app.futureagi.com, select Prototype (left sidebar under BUILD), and click your project support-bot-prototype to see version v1-baseline.

The dashboard shows:

Every traced span with its input, output, token count, and latency
Evaluation scores from your EvalTag configuration (completeness_check and response_quality) displayed alongside each span

Create Version 2 with a different prompt

This is where rapid iteration happens. Register a new version with a different project_version_name and run the same queries with an improved prompt. Each version is a separate experiment you can compare.

Warning

Each call to register() creates a new tracer provider. Run Version 2 in a separate script or after the Version 1 script completes — do not call register() twice in the same process.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType,
    EvalName,
    EvalTag,
    EvalTagType,
    EvalSpanKind,
    ModelChoices,
)
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI

trace_provider_v2 = register(
    project_type=ProjectType.EXPERIMENT,
    project_name="support-bot-prototype",
    project_version_name="v2-detailed",
    eval_tags=[
        EvalTag(
            eval_name=EvalName.COMPLETENESS,
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            model=ModelChoices.TURING_FLASH,
            custom_eval_name="completeness_check",
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
            },
        ),
        EvalTag(
            eval_name=EvalName.SUMMARY_QUALITY,
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            model=ModelChoices.TURING_FLASH,
            custom_eval_name="response_quality",
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
            },
        ),
    ],
)

OpenAIInstrumentor().uninstrument()
OpenAIInstrumentor().instrument(tracer_provider=trace_provider_v2)

client = OpenAI()

questions = [
    "How do I reset my password?",
    "What is your refund policy?",
    "Can I upgrade my plan mid-cycle?",
]

for q in questions:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a knowledgeable customer support agent. "
                    "Provide detailed, step-by-step answers. "
                    "Include any relevant edge cases or exceptions. "
                    "End with a follow-up question to confirm the issue is resolved."
                ),
            },
            {"role": "user", "content": q},
        ],
    )
    print(f"Q: {q}")
    print(f"A: {response.choices[0].message.content}\n")

trace_provider_v2.force_flush()

Expected output:

Q: How do I reset my password?
A: Here's how to reset your password step by step:
1. Go to our login page at app.example.com
2. Click "Forgot Password" below the sign-in button
3. Enter the email address associated with your account
4. Check your inbox for a reset link (check spam if you don't see it within 5 minutes)
5. Click the link and enter your new password

Note: The reset link expires after 24 hours. If it expires, repeat the process.

Is there anything else about your account access I can help with?

Q: What is your refund policy?
...

Compare versions in the dashboard

Back in the Prototype dashboard, your project now shows two versions: v1-baseline and v2-detailed.

Click any version to see its individual traces and eval scores. The project overview shows aggregate metrics across all versions — average eval scores, latency, token usage, and cost — so you can compare at a glance.

Choose the winner

Once you have compared evaluation scores, latency, and cost across versions, choose a winner.

Go to Prototype → click your project
Click Choose Winner — a Winner Settings drawer opens
Under Evaluation Metrics, adjust the importance slider (0 = Not Important, 10 = Very Important) for each eval — completeness_check and response_quality
Under System Metrics, adjust the importance sliders for Avg Cost and Avg Latency
Click Choose Winner to rank all versions

The version with the highest weighted score across your chosen importance values is selected as the winner.

What you built

You can now register a Prototype project, auto-evaluate spans with EvalTags, iterate with versioned prompts, compare versions, and choose the best one for production.

Registered a Prototype project with ProjectType.EXPERIMENT and automatic span evaluation via EvalTag
Ran a baseline OpenAI app (v1) and saw completeness and response quality scores in the dashboard
Iterated with a new prompt version (v2) using a different project_version_name
Compared both versions on eval scores, latency, and cost in the Prototype dashboard
Chose the winning version using weighted metric comparison

Questions & Discussion

Prototype and Iterate on LLM Applications

Install

What is Prototype?

Tutorial

Register a prototype project (Version 1)

Instrument and run your app

View results in the Prototype dashboard

Create Version 2 with a different prompt

Compare versions in the dashboard

Choose the winner

What you built

Next steps

Prototype Overview

Prototype Evals Reference

Compare Prompts in Experiments

Manual Tracing