Monitoring & Alerts: Track LLM Performance and Set Quality Thresholds

Generate rich trace data from a multi-step RAG agent, analyze historical performance trends in the Charts tab, and configure alerts with thresholds and notifications.

📝
TL;DR

Instrument a multi-step RAG agent, explore latency/token/cost trends in Charts, and configure alerts with warning and critical thresholds that notify via email or Slack.

Open in ColabGitHub
TimeDifficultyPackage
15 minIntermediatefi-instrumentation-otel
Prerequisites

Install

pip install fi-instrumentation-otel traceai-openai openai
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"

Tutorial

Build and instrument a multi-step RAG agent

Set up tracing and build an agent with distinct tool, chain, and agent spans. This creates the nested span trees and varied metrics (latency, tokens, cost) that make Charts and Alerts useful.

import os
import time
import random
from openai import OpenAI
from fi_instrumentation import register, FITracer, using_user, using_session, using_metadata, using_tags
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

# 1. Register tracing
trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="monitoring-demo",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

client = OpenAI()
tracer = FITracer(trace_provider.get_tracer(__name__))

# 2. Define agent components using decorators

@tracer.tool(name="search_knowledge_base", description="Search product docs for relevant passages")
def search_knowledge_base(query: str) -> list[str]:
    """Simulates a vector DB search over product documentation."""
    knowledge = {
        "return": ["Items can be returned within 30 days.", "Refunds are processed in 5-7 business days."],
        "shipping": ["Standard shipping takes 5-7 days.", "Express shipping is 1-2 business days.", "Free shipping on orders over $50."],
        "warranty": ["All electronics have a 1-year warranty.", "Extended warranty available for $29.99."],
        "pricing": ["Pro plan is $49/month.", "Enterprise plan is $199/month.", "Annual billing saves 20%."],
        "account": ["Reset password via Settings → Security.", "Two-factor authentication is recommended."],
    }
    results = []
    for key, docs in knowledge.items():
        if key in query.lower():
            results.extend(docs)
    if not results:
        results = ["Please visit our help center at help.example.com for more information."]
    return results


@tracer.chain(name="generate_response")
def generate_response(query: str, context_docs: list[str]) -> str:
    """Uses retrieved context to generate a grounded answer."""
    context = "\n".join(f"- {doc}" for doc in context_docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful support agent. Answer using ONLY the provided context. "
                    "If the context does not contain the answer, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content


@tracer.agent(name="support_rag_agent")
def support_rag_agent(query: str) -> str:
    """Top-level agent: retrieves docs then generates a grounded response."""
    docs = search_knowledge_base(query)
    answer = generate_response(query, docs)
    return answer

The @tracer.agent, @tracer.tool, and @tracer.chain decorators automatically capture function inputs/outputs and set fi.span_kind attributes on each span. This creates a span tree: support_rag_agent (AGENT) → search_knowledge_base (TOOL) → generate_response (CHAIN) → OpenAI LLM span.

Generate trace data across multiple queries

Run the agent in a loop with varied queries, users, and sessions to produce enough data points for meaningful charts and alert thresholds.

# Diverse queries that exercise different knowledge base paths
test_queries = [
    "What is your return policy?",
    "How long does shipping take?",
    "Do you offer express shipping?",
    "What warranty comes with electronics?",
    "How much is the Pro plan?",
    "Can I get a discount on annual billing?",
    "How do I reset my password?",
    "What is the refund timeline?",
    "Is there free shipping?",
    "Tell me about the extended warranty.",
]

users = ["user-alice", "user-bob", "user-carol", "user-dave", "user-eve"]
environments = ["production", "staging"]

print("Generating trace data...\n")
for i, query in enumerate(test_queries):
    user_id = users[i % len(users)]
    session_id = f"session-{user_id}-{i // len(users)}"
    env_tag = environments[i % len(environments)]

    with (
        using_user(user_id),
        using_session(session_id),
        using_metadata({"environment": env_tag, "query_index": str(i)}),
        using_tags([env_tag, "rag-pipeline", "monitoring-demo"]),
    ):
        answer = support_rag_agent(query)
        print(f"[{user_id}] Q: {query}")
        print(f"         A: {answer[:80]}...\n")

    # Small delay between queries to spread data points over time
    time.sleep(0.5)

trace_provider.force_flush()
print("All traces flushed. Data is now available in Tracing.")

Expected output:

Generating trace data...

[user-alice] Q: What is your return policy?
         A: Items can be returned within 30 days of purchase. Refunds are processed in...

[user-bob] Q: How long does shipping take?
         A: Standard shipping takes 5-7 business days. Express shipping is available fo...

[user-carol] Q: Do you offer express shipping?
         A: Yes, express shipping is available and takes 1-2 business days...

...

All traces flushed. Data is now available in Tracing.

Wait 1-2 minutes for the traces to appear in the dashboard before proceeding.

Tip

For more realistic alerting scenarios, run this script multiple times across different hours or days. Alerts evaluate metrics over time windows, so more data spread over time produces better threshold previews.

Analyze historical trends in the Charts tab

Go to app.futureagi.comTracing (left sidebar under OBSERVE) → select your project (monitoring-demo) → click the Charts tab (4th tab, after LLM Tracing, Sessions, and Documents).

The Charts tab shows system-level performance metrics over time:

ChartWhat it shows
LatencyAverage response time in milliseconds across all spans
TokensTotal token consumption (input + output) summed across spans
TrafficTotal span count — how many operations your agent executed
CostAverage cost per span in dollars

If you have evaluation metrics configured on this project — via Inline Evals in Tracing — additional charts appear below the system metrics, one per evaluation metric.

Controls

  • Date range — select from presets (Today, Yesterday, 7D, 30D, 3M, 6M, 12M) or a custom range
  • Interval — the dropdown on the right groups data by Hour, Day, Week, or Month. Hour is disabled for ranges longer than 7 days; Month is disabled for ranges shorter than 90 days
  • Zoom — click and drag on any chart to zoom in. All four system metric charts sync to the same zoomed range
  • Refresh — re-fetch all chart data
  • View Traces — jump to the LLM Tracing tab with the same date filter applied

Tip

Use the Charts tab as a daily health check. A sudden spike in Latency or drop in Traffic often signals an upstream provider issue before your users notice.

Create an alert

Go to app.futureagi.comTracing (left sidebar under OBSERVE) → select your project (monitoring-demo) → click the Alerts tab (5th tab, after Charts).

Click Create Alerts to open the alert creation drawer.

4a. Select alert type

The first tab shows two categories:

Application Performance alerts:

Alert typeWhat it monitors
Count of errorsTotal error count across spans
Span response timeEnd-to-end latency of spans
LLM response timeLatency of LLM-specific spans
LLM API failure ratesPercentage of failed LLM API calls
Error rates for function callingFailure rate of tool/function call spans
Error free session ratesPercentage of sessions with zero errors
Service provider error ratesErrors grouped by LLM provider

Metric Alerts:

Alert typeWhat it monitors
Evaluation metricsScores from inline evals attached to traces
Token usageToken consumption per span
Daily tokens spentAggregate daily token usage
Monthly tokens spentAggregate monthly token usage

Select LLM response time under Application Performance, then proceed to the next tab.

4b. Set alert configuration

The second tab has five sections. Fill them in order:

Name — enter High LLM Latency.

Define Metrics & Interval — the metric is pre-filled from your selection (LLM response time). Set the Interval dropdown to 15 minute interval — this is how often the alert evaluates the metric.

Filter Events — optionally click Add Filter to narrow the alert to specific span attributes (e.g., only spans from a certain environment or model). Leave empty for this example.

Define Alert — choose Static Value (alerts when the metric is above or below a fixed number). Then configure the two threshold levels:

  • Critical — set Threshold to Above and Value to 5000. This fires when LLM response time exceeds 5000ms
  • Warning — set Threshold to Above and Value to 2000. This fires when LLM response time exceeds 2000ms

The warning value must be less severe than critical (for “Above” alerts: warning < critical).

Define Notification — choose Email or Slack:

  • Email — enter up to 5 comma-separated email addresses
  • Slack — paste a Slack webhook URL and optionally add notes (e.g., the channel name)

Tip

To create a Slack webhook URL, go to your Slack workspace settings → Apps → Incoming Webhooks → Add New Webhook. Copy the URL and paste it into the Slack notification field.

Monitor and manage alerts

After creating alerts, the Alerts tab shows all alerts for this project in a searchable list. Use the search bar to find alerts by name.

View alert details

Click any alert to see:

  • Configuration — the alert type, thresholds, check frequency, and notification channels
  • Trigger history (logs) — a timeline of every time the alert fired, showing:
    • Alert level: Warning or Critical
    • Message describing what triggered it
    • Timestamp of when it fired
    • Whether it has been resolved
  • Current status — whether the alert is active, in warning state, in critical state, or resolved

Manage alerts

From the alert detail view or the alerts list, you can:

  • Mute/unmute — temporarily silence notifications without deleting the alert. Useful during maintenance windows
  • Edit — change thresholds, check frequency, or notification channels
  • Duplicate — clone an alert to create a similar one with different thresholds (e.g., duplicate the latency alert and change it to monitor token usage)
  • Delete — permanently remove the alert

Tip

Start with a few high-signal alerts — LLM response time, error rates, and daily token spend — rather than alerting on everything. Too many alerts cause notification fatigue and get ignored.

What you built

You can now generate rich trace data from an instrumented agent, analyze performance trends in Charts, and configure alerts with thresholds and notifications.

  • Instrumented a multi-step RAG agent with @tracer.agent, @tracer.tool, and @tracer.chain decorators for rich span trees
  • Generated diverse trace data across multiple users, sessions, and environments using context managers
  • Explored historical performance trends — Latency, Tokens, Traffic, and Cost — in the Charts tab with date range and interval controls
  • Created an LLM response time alert with static warning (2000ms) and critical (5000ms) thresholds
  • Configured email and Slack notifications for threshold breaches
  • Reviewed alert trigger history, mute/unmute controls, and alert management options
Was this page helpful?

Questions & Discussion