OpenTelemetry Integration: Tracing for Future AGI SDK

Built-in OpenTelemetry for the AI evaluation SDK. Auto-instrument LLM calls, track costs, enrich spans with scores, and export to any backend.

📝

TL;DR

setup_tracing() configures OTel with sensible defaults for LLM observability
Auto-instrument OpenAI and Anthropic with instrument_all()
Track token costs, enrich spans with scores, export to 13+ backends

The OTel module adds OpenTelemetry instrumentation directly into ai-evaluation. Trace LLM calls, calculate per-call costs, attach scores to spans, and export to any OTel-compatible backend.

Note

Requires pip install ai-evaluation. This is separate from the fi-instrumentation-otel + traceai-* packages in Tracing. Use this when you want observability tightly coupled with your scoring pipeline. Use fi-instrumentation-otel for standalone tracing across your whole stack.

Quick Example

from fi.evals.otel import setup_tracing, instrument_all, enable_auto_enrichment

# 1. Set up tracing
setup_tracing(service_name="my-app", otlp_endpoint="http://localhost:4317")

# 2. Auto-instrument all supported LLM libraries
instrumented = instrument_all()
print(f"Instrumented: {instrumented}")  # ["openai", "anthropic"]

# 3. Enable auto-enrichment — scores automatically attach to spans
enable_auto_enrichment()

# Now all OpenAI/Anthropic calls are traced, costs calculated,
# and scores are attached to the active span

Setup

Basic

from fi.evals.otel import setup_tracing

setup_tracing(service_name="my-app")  # exports to console by default

With OTLP endpoint

setup_tracing(
    service_name="my-app",
    otlp_endpoint="http://localhost:4317",
)

With TraceConfig

from fi.evals.otel import setup_tracing, TraceConfig

# Development — console output, all content captured
config = TraceConfig.development("my-app")

# Production — OTLP export, 10% sampling, cost alerts
config = TraceConfig.production(
    service_name="my-app",
    otlp_endpoint="https://otel-collector.internal:4317",
    service_version="2.1.0",
    eval_sample_rate=0.1,
)

# Multi-backend — export to multiple destinations
config = TraceConfig.multi_backend(
    service_name="my-app",
    backends=[
        {"type": "jaeger", "endpoint": "localhost:6831"},
        {"type": "datadog"},
    ],
)

setup_tracing(config=config)

Tracer utilities

from fi.evals.otel import get_tracer, get_current_span, is_tracing_enabled, shutdown_tracing

tracer = get_tracer("my-module")
span = get_current_span()
enabled = is_tracing_enabled()
shutdown_tracing()  # flush and shutdown

Auto-Instrumentation

Instrument LLM libraries with one call. Currently supports OpenAI and Anthropic.

from fi.evals.otel import instrument_all, uninstrument_all, instrument, uninstrument

# Instrument everything available
libraries = instrument_all()  # ["openai", "anthropic"]

# Or instrument individually
instrument("openai", capture_prompts=True, capture_completions=True, capture_streaming=True)
instrument("anthropic")

# Check status
from fi.evals.otel import is_instrumented, get_instrumented_libraries
print(is_instrumented("openai"))       # True
print(get_instrumented_libraries())    # ["openai", "anthropic"]

# Remove instrumentation
uninstrument("openai")
uninstrument_all()

Tracing LLM calls manually

from fi.evals.otel import trace_llm_call

with trace_llm_call("chat", model="gpt-4o", system="openai") as span:
    response = client.chat.completions.create(...)
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

Auto-Enrichment

When enabled, scoring calls automatically attach their results to the current active span.

from fi.evals.otel import enable_auto_enrichment, disable_auto_enrichment, is_auto_enrichment_enabled
from fi.evals import evaluate as run_eval

enable_auto_enrichment()

# This call automatically enriches the current span
result = run_eval("toxicity", output="Hello world", model="turing_flash")
# The span now has: eval.toxicity = 1.0

disable_auto_enrichment()

Manual enrichment

from fi.evals.otel import enrich_span_with_evaluation, enrich_span_with_eval_result, enrich_span_with_batch_result

# By metric name + score
enrich_span_with_evaluation("toxicity", score=0.95, reason="Safe content")

# From a result object
enrich_span_with_eval_result(result)

# From a batch result
count = enrich_span_with_batch_result(results)  # returns number attached

Span context for scoring

Create a child span specifically for scoring:

from fi.evals.otel import EvaluationSpanContext

with EvaluationSpanContext("my_check") as ctx:
    score = run_my_custom_check(...)
    ctx.record_result(score=score, reason="explanation")

Cost Tracking

Automatically calculate token costs for every LLM call.

from fi.evals.otel import CostSpanProcessor, calculate_cost, DEFAULT_PRICING, TokenPricing

# Add cost tracking to your pipeline
processor = CostSpanProcessor(
    alert_threshold_usd=10.0,
    on_cost_alert=lambda total_cost, span_id: print(f"ALERT: cost ${total_cost:.2f} on span {span_id}"),
)

# Or calculate costs manually
costs = calculate_cost("gpt-4o", input_tokens=1000, output_tokens=500)
print(costs)  # {"input_cost": 0.005, "output_cost": 0.0075, "total_cost": 0.0125}

# Get running totals
print(processor.total_cost_usd)
print(processor.get_summary())

# Add custom pricing
processor.add_custom_pricing("my-model", TokenPricing(
    model="my-model",
    input_per_1k=0.001,
    output_per_1k=0.002,
))

Built-in pricing

DEFAULT_PRICING includes 30+ models: OpenAI (gpt-4o, gpt-4o-mini, o1), Anthropic (claude-3.5-sonnet, claude-3-opus), Google (gemini-1.5-pro, gemini-2.0-flash), Mistral, Cohere, Meta, and embeddings.

Span Processors

Custom processors that run on every span.

LLMSpanProcessor

Extracts and normalizes LLM attributes from spans.

from fi.evals.otel import LLMSpanProcessor

processor = LLMSpanProcessor(
    capture_prompts=True,
    capture_completions=True,
    max_content_length=10000,
    redact_patterns=[r"\b\d{3}-\d{2}-\d{4}\b"],  # redact SSNs
)

EvaluationSpanProcessor

Runs scoring on LLM spans automatically.

from fi.evals.otel import EvaluationSpanProcessor

processor = EvaluationSpanProcessor(
    metrics=["relevance", "coherence"],
    sample_rate=0.1,          # score 10% of spans
    async_evaluation=True,    # don't block the span
    cache_enabled=True,
    evaluator_model="turing_flash",
)

BatchEvaluationProcessor

Batches spans for scoring efficiency.

from fi.evals.otel import BatchEvaluationProcessor

processor = BatchEvaluationProcessor(
    metrics=["toxicity"],
    batch_size=10,
    batch_timeout_ms=1000,
)

FilteringSpanProcessor

Only process spans matching a filter.

from fi.evals.otel import FilteringSpanProcessor, CostSpanProcessor

cost_processor = CostSpanProcessor()
filtered = FilteringSpanProcessor(
    filter_fn=lambda span: "gpt-4" in str(span.attributes.get("gen_ai.request.model", "")),
    delegate=cost_processor,
)

CompositeSpanProcessor

Chain multiple processors together.

from fi.evals.otel import CompositeSpanProcessor, LLMSpanProcessor, CostSpanProcessor

composite = CompositeSpanProcessor(
    processors=[LLMSpanProcessor(), CostSpanProcessor()],
    parallel=True,
)

Exporter Backends

Export traces to any OTel-compatible backend.

Backend	ExporterType	Default Endpoint
OTLP (gRPC)	`OTLP_GRPC`	`localhost:4317`
OTLP (HTTP)	`OTLP_HTTP`	`localhost:4318`
Jaeger	`JAEGER`	`localhost:14268`
Zipkin	`ZIPKIN`	`localhost:9411`
Console	`CONSOLE`	stdout
Datadog	`DATADOG`	Datadog agent
Honeycomb	`HONEYCOMB`	Honeycomb API
New Relic	`NEWRELIC`	New Relic API
Arize	`ARIZE`	Arize Phoenix
Langfuse	`LANGFUSE`	Langfuse API
Phoenix	`PHOENIX`	Arize Phoenix
Future AGI	`FUTUREAGI`	Future AGI API
Custom	`CUSTOM`	Your endpoint

from fi.evals.otel import TraceConfig, ExporterConfig, ExporterType, get_exporter_preset

# Use a preset
config = TraceConfig(exporters=[get_exporter_preset("jaeger")])

# Or configure manually
config = TraceConfig(exporters=[
    ExporterConfig(type=ExporterType.OTLP_GRPC, endpoint="http://localhost:4317"),
    ExporterConfig(type=ExporterType.CONSOLE),  # also log to console
])

Configuration Reference

TraceConfig

Field	Type	Default	Description
`service_name`	str	`"llm-service"`	Service name in traces
`exporters`	list	`[CONSOLE]`	Where to send traces
`processors`	list	`[]`	Span processors to run
`sampling_strategy`	SamplingStrategy	`ALWAYS_ON`	`ALWAYS_ON`, `ALWAYS_OFF`, `RATIO`, `PARENT_BASED`
`evaluation`	EvaluationConfig or None	None	Auto-scoring settings
`cost`	CostConfig or None	None	Cost tracking settings
`content`	ContentConfig or None	None	Content capture/redaction
`resource`	ResourceConfig or None	None	Service metadata
`enabled`	bool	True	Master switch
`debug`	bool	False	Debug logging

ContentConfig

Field	Type	Default	Description
`capture_prompts`	bool	True	Capture input messages
`capture_completions`	bool	True	Capture output messages
`max_content_length`	int	10000	Truncate content beyond this
`redact_patterns`	list	`[]`	Regex patterns to redact
`redact_pii`	bool	False	Auto-redact PII
`pii_types`	list	`["email", "phone", "ssn"]`	PII types to redact

CostConfig

Field	Type	Default	Description
`enabled`	bool	True	Enable cost tracking
`pricing_source`	str	`"litellm"`	Where to get pricing
`custom_pricing`	dict	`{}`	Custom model pricing
`currency`	str	`"USD"`	Currency for costs
`alert_threshold_usd`	float or None	None	Alert when cost exceeds
`alert_callback`	callable or None	None	Called on threshold

EvaluationConfig

Field	Type	Default	Description
`enabled`	bool	True	Enable auto-scoring on spans
`metrics`	list	`["relevance", "coherence"]`	Metrics to run
`sample_rate`	float	1.0	Fraction of spans to score
`async_evaluation`	bool	True	Non-blocking scoring
`timeout_ms`	int	5000	Timeout per scoring call
`cache_enabled`	bool	True	Cache results
`cache_ttl_seconds`	int	3600	Cache TTL
`evaluator_model`	str or None	None	Model for cloud scoring

Exporting to Future AGI

from fi.evals.otel import setup_tracing, TraceConfig, ExporterConfig, ExporterType

config = TraceConfig(
    service_name="my-app",
    exporters=[ExporterConfig(type=ExporterType.FUTUREAGI)],
)

# Reads FI_API_KEY, FI_SECRET_KEY, FI_PROJECT_NAME from environment
setup_tracing(config=config)

Environment Variables

Variable	Purpose	Used by
`OTEL_SERVICE_NAME`	Service name in traces	`setup_tracing()`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP exporter endpoint	OTLP exporters
`OTEL_EXPORTER_OTLP_HEADERS`	OTLP exporter headers	OTLP exporters
`OTEL_DEPLOYMENT_ENVIRONMENT`	Deployment environment label	ResourceConfig
`FI_API_KEY`	Future AGI API key	FutureAGI exporter
`FI_SECRET_KEY`	Future AGI secret key	FutureAGI exporter
`FI_BASE_URL`	Future AGI API endpoint	FutureAGI exporter
`FI_PROJECT_NAME`	Project name	FutureAGI exporter

Tip

If OpenTelemetry packages are not installed, the module degrades gracefully — all functions become no-ops. Your code won’t crash, tracing just silently disables itself.

Semantic Conventions

Standard attribute names for LLM traces.

from fi.evals.otel import GenAIAttributes, CostAttributes, EvaluationAttributes, RAGAttributes

# LLM attributes
GenAIAttributes.PROVIDER_NAME       # "gen_ai.provider.name" (preferred)
GenAIAttributes.SYSTEM              # "gen_ai.system" (deprecated, use PROVIDER_NAME)
GenAIAttributes.REQUEST_MODEL       # "gen_ai.request.model"
GenAIAttributes.USAGE_INPUT_TOKENS  # "gen_ai.usage.input_tokens"
GenAIAttributes.USAGE_OUTPUT_TOKENS # "gen_ai.usage.output_tokens"

# Cost attributes
CostAttributes.TOTAL     # "gen_ai.cost.total"
CostAttributes.INPUT     # "gen_ai.cost.input"

# Scoring attributes
EvaluationAttributes.NAME           # "gen_ai.evaluation.name"
EvaluationAttributes.SCORE_VALUE    # "gen_ai.evaluation.score.value"
EvaluationAttributes.EXPLANATION    # "gen_ai.evaluation.explanation"
# Legacy (still works): EvaluationAttributes.score("toxicity") → "eval.toxicity"

# RAG attributes (indexed)
RAGAttributes.NUM_DOCUMENTS          # "rag.num_documents"
RAGAttributes.document_content(0)    # "rag.document.0.content"
RAGAttributes.document_score(0)      # "rag.document.0.score"

Helper functions

from fi.evals.otel import normalize_system_name, create_llm_span_attributes, create_evaluation_attributes

system = normalize_system_name("OpenAI")  # "openai"

attrs = create_llm_span_attributes(
    system="openai", model="gpt-4o",
    input_tokens=100, output_tokens=50,
)

eval_attrs = create_evaluation_attributes(
    metric="toxicity", score=0.95, reason="Safe",
)

Questions & Discussion

OpenTelemetry Integration: Tracing for Future AGI SDK

Quick Example

Setup

Basic

With OTLP endpoint

With TraceConfig

Tracer utilities

Auto-Instrumentation

Tracing LLM calls manually

Auto-Enrichment

Manual enrichment

Span context for scoring

Cost Tracking

Built-in pricing

Span Processors

LLMSpanProcessor

EvaluationSpanProcessor

BatchEvaluationProcessor

FilteringSpanProcessor

CompositeSpanProcessor

Exporter Backends

Configuration Reference

TraceConfig

ContentConfig

CostConfig

EvaluationConfig

Exporting to Future AGI

Environment Variables

Semantic Conventions

Helper functions

Tracing SDK

Running Evaluations

Streaming

Distributed

Quick Example

Setup

Basic

With OTLP endpoint

With TraceConfig

Tracer utilities

Auto-Instrumentation

Tracing LLM calls manually

Auto-Enrichment

Manual enrichment

Span context for scoring

Cost Tracking

Built-in pricing

Span Processors

LLMSpanProcessor

EvaluationSpanProcessor

BatchEvaluationProcessor

FilteringSpanProcessor

CompositeSpanProcessor

Exporter Backends

Configuration Reference

TraceConfig

ContentConfig

CostConfig

EvaluationConfig

Exporting to Future AGI

Environment Variables

Semantic Conventions

Helper functions

Related

Tracing SDK

Running Evaluations

Streaming

Distributed