OpenTelemetry Integration

Built-in OpenTelemetry for the AI evaluation SDK. Auto-instrument LLM calls, track costs, enrich spans with scores, and export to any backend.

📝
TL;DR
  • setup_tracing() configures OTel with sensible defaults for LLM observability
  • Auto-instrument OpenAI and Anthropic with instrument_all()
  • Track token costs, enrich spans with scores, export to 13+ backends

The OTel module adds OpenTelemetry instrumentation directly into ai-evaluation. Trace LLM calls, calculate per-call costs, attach scores to spans, and export to any OTel-compatible backend.

Note

Requires pip install ai-evaluation. This is separate from the fi-instrumentation-otel + traceai-* packages in Tracing. Use this when you want observability tightly coupled with your scoring pipeline. Use fi-instrumentation-otel for standalone tracing across your whole stack.

Quick Example

from fi.evals.otel import setup_tracing, instrument_all, enable_auto_enrichment

# 1. Set up tracing
setup_tracing(service_name="my-app", otlp_endpoint="http://localhost:4317")

# 2. Auto-instrument all supported LLM libraries
instrumented = instrument_all()
print(f"Instrumented: {instrumented}")  # ["openai", "anthropic"]

# 3. Enable auto-enrichment — scores automatically attach to spans
enable_auto_enrichment()

# Now all OpenAI/Anthropic calls are traced, costs calculated,
# and scores are attached to the active span

Setup

Basic

from fi.evals.otel import setup_tracing

setup_tracing(service_name="my-app")  # exports to console by default

With OTLP endpoint

setup_tracing(
    service_name="my-app",
    otlp_endpoint="http://localhost:4317",
)

With TraceConfig

from fi.evals.otel import setup_tracing, TraceConfig

# Development — console output, all content captured
config = TraceConfig.development("my-app")

# Production — OTLP export, 10% sampling, cost alerts
config = TraceConfig.production(
    service_name="my-app",
    otlp_endpoint="https://otel-collector.internal:4317",
    service_version="2.1.0",
    eval_sample_rate=0.1,
)

# Multi-backend — export to multiple destinations
config = TraceConfig.multi_backend(
    service_name="my-app",
    backends=[
        {"type": "jaeger", "endpoint": "localhost:6831"},
        {"type": "datadog"},
    ],
)

setup_tracing(config=config)

Tracer utilities

from fi.evals.otel import get_tracer, get_current_span, is_tracing_enabled, shutdown_tracing

tracer = get_tracer("my-module")
span = get_current_span()
enabled = is_tracing_enabled()
shutdown_tracing()  # flush and shutdown

Auto-Instrumentation

Instrument LLM libraries with one call. Currently supports OpenAI and Anthropic.

from fi.evals.otel import instrument_all, uninstrument_all, instrument, uninstrument

# Instrument everything available
libraries = instrument_all()  # ["openai", "anthropic"]

# Or instrument individually
instrument("openai", capture_prompts=True, capture_completions=True, capture_streaming=True)
instrument("anthropic")

# Check status
from fi.evals.otel import is_instrumented, get_instrumented_libraries
print(is_instrumented("openai"))       # True
print(get_instrumented_libraries())    # ["openai", "anthropic"]

# Remove instrumentation
uninstrument("openai")
uninstrument_all()

Tracing LLM calls manually

from fi.evals.otel import trace_llm_call

with trace_llm_call("chat", model="gpt-4o", system="openai") as span:
    response = client.chat.completions.create(...)
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

Auto-Enrichment

When enabled, scoring calls automatically attach their results to the current active span.

from fi.evals.otel import enable_auto_enrichment, disable_auto_enrichment, is_auto_enrichment_enabled
from fi.evals import evaluate as run_eval

enable_auto_enrichment()

# This call automatically enriches the current span
result = run_eval("toxicity", output="Hello world", model="turing_flash")
# The span now has: eval.toxicity = 1.0

disable_auto_enrichment()

Manual enrichment

from fi.evals.otel import enrich_span_with_evaluation, enrich_span_with_eval_result, enrich_span_with_batch_result

# By metric name + score
enrich_span_with_evaluation("toxicity", score=0.95, reason="Safe content")

# From a result object
enrich_span_with_eval_result(result)

# From a batch result
count = enrich_span_with_batch_result(results)  # returns number attached

Span context for scoring

Create a child span specifically for scoring:

from fi.evals.otel import EvaluationSpanContext

with EvaluationSpanContext("my_check") as ctx:
    score = run_my_custom_check(...)
    ctx.record_result(score=score, reason="explanation")

Cost Tracking

Automatically calculate token costs for every LLM call.

from fi.evals.otel import CostSpanProcessor, calculate_cost, DEFAULT_PRICING, TokenPricing

# Add cost tracking to your pipeline
processor = CostSpanProcessor(
    alert_threshold_usd=10.0,
    on_cost_alert=lambda total_cost, span_id: print(f"ALERT: cost ${total_cost:.2f} on span {span_id}"),
)

# Or calculate costs manually
costs = calculate_cost("gpt-4o", input_tokens=1000, output_tokens=500)
print(costs)  # {"input_cost": 0.005, "output_cost": 0.0075, "total_cost": 0.0125}

# Get running totals
print(processor.total_cost_usd)
print(processor.get_summary())

# Add custom pricing
processor.add_custom_pricing("my-model", TokenPricing(
    model="my-model",
    input_per_1k=0.001,
    output_per_1k=0.002,
))

Built-in pricing

DEFAULT_PRICING includes 30+ models: OpenAI (gpt-4o, gpt-4o-mini, o1), Anthropic (claude-3.5-sonnet, claude-3-opus), Google (gemini-1.5-pro, gemini-2.0-flash), Mistral, Cohere, Meta, and embeddings.

Span Processors

Custom processors that run on every span.

LLMSpanProcessor

Extracts and normalizes LLM attributes from spans.

from fi.evals.otel import LLMSpanProcessor

processor = LLMSpanProcessor(
    capture_prompts=True,
    capture_completions=True,
    max_content_length=10000,
    redact_patterns=[r"\b\d{3}-\d{2}-\d{4}\b"],  # redact SSNs
)

EvaluationSpanProcessor

Runs scoring on LLM spans automatically.

from fi.evals.otel import EvaluationSpanProcessor

processor = EvaluationSpanProcessor(
    metrics=["relevance", "coherence"],
    sample_rate=0.1,          # score 10% of spans
    async_evaluation=True,    # don't block the span
    cache_enabled=True,
    evaluator_model="turing_flash",
)

BatchEvaluationProcessor

Batches spans for scoring efficiency.

from fi.evals.otel import BatchEvaluationProcessor

processor = BatchEvaluationProcessor(
    metrics=["toxicity"],
    batch_size=10,
    batch_timeout_ms=1000,
)

FilteringSpanProcessor

Only process spans matching a filter.

from fi.evals.otel import FilteringSpanProcessor, CostSpanProcessor

cost_processor = CostSpanProcessor()
filtered = FilteringSpanProcessor(
    filter_fn=lambda span: "gpt-4" in str(span.attributes.get("gen_ai.request.model", "")),
    delegate=cost_processor,
)

CompositeSpanProcessor

Chain multiple processors together.

from fi.evals.otel import CompositeSpanProcessor, LLMSpanProcessor, CostSpanProcessor

composite = CompositeSpanProcessor(
    processors=[LLMSpanProcessor(), CostSpanProcessor()],
    parallel=True,
)

Exporter Backends

Export traces to any OTel-compatible backend.

BackendExporterTypeDefault Endpoint
OTLP (gRPC)OTLP_GRPClocalhost:4317
OTLP (HTTP)OTLP_HTTPlocalhost:4318
JaegerJAEGERlocalhost:14268
ZipkinZIPKINlocalhost:9411
ConsoleCONSOLEstdout
DatadogDATADOGDatadog agent
HoneycombHONEYCOMBHoneycomb API
New RelicNEWRELICNew Relic API
ArizeARIZEArize Phoenix
LangfuseLANGFUSELangfuse API
PhoenixPHOENIXArize Phoenix
Future AGIFUTUREAGIFuture AGI API
CustomCUSTOMYour endpoint
from fi.evals.otel import TraceConfig, ExporterConfig, ExporterType, get_exporter_preset

# Use a preset
config = TraceConfig(exporters=[get_exporter_preset("jaeger")])

# Or configure manually
config = TraceConfig(exporters=[
    ExporterConfig(type=ExporterType.OTLP_GRPC, endpoint="http://localhost:4317"),
    ExporterConfig(type=ExporterType.CONSOLE),  # also log to console
])

Configuration Reference

TraceConfig

FieldTypeDefaultDescription
service_namestr"llm-service"Service name in traces
exporterslist[CONSOLE]Where to send traces
processorslist[]Span processors to run
sampling_strategySamplingStrategyALWAYS_ONALWAYS_ON, ALWAYS_OFF, RATIO, PARENT_BASED
evaluationEvaluationConfig or NoneNoneAuto-scoring settings
costCostConfig or NoneNoneCost tracking settings
contentContentConfig or NoneNoneContent capture/redaction
resourceResourceConfig or NoneNoneService metadata
enabledboolTrueMaster switch
debugboolFalseDebug logging

ContentConfig

FieldTypeDefaultDescription
capture_promptsboolTrueCapture input messages
capture_completionsboolTrueCapture output messages
max_content_lengthint10000Truncate content beyond this
redact_patternslist[]Regex patterns to redact
redact_piiboolFalseAuto-redact PII
pii_typeslist["email", "phone", "ssn"]PII types to redact

CostConfig

FieldTypeDefaultDescription
enabledboolTrueEnable cost tracking
pricing_sourcestr"litellm"Where to get pricing
custom_pricingdict{}Custom model pricing
currencystr"USD"Currency for costs
alert_threshold_usdfloat or NoneNoneAlert when cost exceeds
alert_callbackcallable or NoneNoneCalled on threshold

EvaluationConfig

FieldTypeDefaultDescription
enabledboolTrueEnable auto-scoring on spans
metricslist["relevance", "coherence"]Metrics to run
sample_ratefloat1.0Fraction of spans to score
async_evaluationboolTrueNon-blocking scoring
timeout_msint5000Timeout per scoring call
cache_enabledboolTrueCache results
cache_ttl_secondsint3600Cache TTL
evaluator_modelstr or NoneNoneModel for cloud scoring

Exporting to Future AGI

from fi.evals.otel import setup_tracing, TraceConfig, ExporterConfig, ExporterType

config = TraceConfig(
    service_name="my-app",
    exporters=[ExporterConfig(type=ExporterType.FUTUREAGI)],
)

# Reads FI_API_KEY, FI_SECRET_KEY, FI_PROJECT_NAME from environment
setup_tracing(config=config)

Environment Variables

VariablePurposeUsed by
OTEL_SERVICE_NAMEService name in tracessetup_tracing()
OTEL_EXPORTER_OTLP_ENDPOINTOTLP exporter endpointOTLP exporters
OTEL_EXPORTER_OTLP_HEADERSOTLP exporter headersOTLP exporters
OTEL_DEPLOYMENT_ENVIRONMENTDeployment environment labelResourceConfig
FI_API_KEYFuture AGI API keyFutureAGI exporter
FI_SECRET_KEYFuture AGI secret keyFutureAGI exporter
FI_BASE_URLFuture AGI API endpointFutureAGI exporter
FI_PROJECT_NAMEProject nameFutureAGI exporter

Tip

If OpenTelemetry packages are not installed, the module degrades gracefully — all functions become no-ops. Your code won’t crash, tracing just silently disables itself.

Semantic Conventions

Standard attribute names for LLM traces.

from fi.evals.otel import GenAIAttributes, CostAttributes, EvaluationAttributes, RAGAttributes

# LLM attributes
GenAIAttributes.PROVIDER_NAME       # "gen_ai.provider.name" (preferred)
GenAIAttributes.SYSTEM              # "gen_ai.system" (deprecated, use PROVIDER_NAME)
GenAIAttributes.REQUEST_MODEL       # "gen_ai.request.model"
GenAIAttributes.USAGE_INPUT_TOKENS  # "gen_ai.usage.input_tokens"
GenAIAttributes.USAGE_OUTPUT_TOKENS # "gen_ai.usage.output_tokens"

# Cost attributes
CostAttributes.TOTAL     # "gen_ai.cost.total"
CostAttributes.INPUT     # "gen_ai.cost.input"

# Scoring attributes
EvaluationAttributes.NAME           # "gen_ai.evaluation.name"
EvaluationAttributes.SCORE_VALUE    # "gen_ai.evaluation.score.value"
EvaluationAttributes.EXPLANATION    # "gen_ai.evaluation.explanation"
# Legacy (still works): EvaluationAttributes.score("toxicity") → "eval.toxicity"

# RAG attributes (indexed)
RAGAttributes.NUM_DOCUMENTS          # "rag.num_documents"
RAGAttributes.document_content(0)    # "rag.document.0.content"
RAGAttributes.document_score(0)      # "rag.document.0.score"

Helper functions

from fi.evals.otel import normalize_system_name, create_llm_span_attributes, create_evaluation_attributes

system = normalize_system_name("OpenAI")  # "openai"

attrs = create_llm_span_attributes(
    system="openai", model="gpt-4o",
    input_tokens=100, output_tokens=50,
)

eval_attrs = create_evaluation_attributes(
    metric="toxicity", score=0.95, reason="Safe",
)
Was this page helpful?

Questions & Discussion