Monitoring and Alerts: Track LLM Performance Thresholds

Instrument a multi-step RAG agent, analyze latency and cost trends in Charts, and configure warning and critical alerts with email or Slack notifications.

📝

TL;DR

Instrument a multi-step RAG agent, explore latency/token/cost trends in Charts, and configure alerts with warning and critical thresholds that notify via email or Slack.

Time	Difficulty	Package
15 min	Intermediate	`fi-instrumentation-otel`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+
OpenAI API key (for the agent in Steps 1-2)

Install

pip install fi-instrumentation-otel traceai-openai openai

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"

Tutorial

Build and instrument a multi-step RAG agent

Set up tracing and build an agent with distinct tool, chain, and agent spans. This creates the nested span trees and varied metrics (latency, tokens, cost) that make Charts and Alerts useful.

import os
import time
import random
from openai import OpenAI
from fi_instrumentation import register, FITracer, using_user, using_session, using_metadata, using_tags
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

# 1. Register tracing
trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="monitoring-demo",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

client = OpenAI()
tracer = FITracer(trace_provider.get_tracer(__name__))

# 2. Define agent components using decorators

@tracer.tool(name="search_knowledge_base", description="Search product docs for relevant passages")
def search_knowledge_base(query: str) -> list[str]:
    """Simulates a vector DB search over product documentation."""
    knowledge = {
        "return": ["Items can be returned within 30 days.", "Refunds are processed in 5-7 business days."],
        "shipping": ["Standard shipping takes 5-7 days.", "Express shipping is 1-2 business days.", "Free shipping on orders over $50."],
        "warranty": ["All electronics have a 1-year warranty.", "Extended warranty available for $29.99."],
        "pricing": ["Pro plan is $49/month.", "Enterprise plan is $199/month.", "Annual billing saves 20%."],
        "account": ["Reset password via Settings → Security.", "Two-factor authentication is recommended."],
    }
    results = []
    for key, docs in knowledge.items():
        if key in query.lower():
            results.extend(docs)
    if not results:
        results = ["Please visit our help center at help.example.com for more information."]
    return results


@tracer.chain(name="generate_response")
def generate_response(query: str, context_docs: list[str]) -> str:
    """Uses retrieved context to generate a grounded answer."""
    context = "\n".join(f"- {doc}" for doc in context_docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful support agent. Answer using ONLY the provided context. "
                    "If the context does not contain the answer, say so.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content


@tracer.agent(name="support_rag_agent")
def support_rag_agent(query: str) -> str:
    """Top-level agent: retrieves docs then generates a grounded response."""
    docs = search_knowledge_base(query)
    answer = generate_response(query, docs)
    return answer

The @tracer.agent, @tracer.tool, and @tracer.chain decorators automatically capture function inputs/outputs and set fi.span_kind attributes on each span. This creates a span tree: support_rag_agent (AGENT) → search_knowledge_base (TOOL) → generate_response (CHAIN) → OpenAI LLM span.

Generate trace data across multiple queries

Run the agent in a loop with varied queries, users, and sessions to produce enough data points for meaningful charts and alert thresholds.

# Diverse queries that exercise different knowledge base paths
test_queries = [
    "What is your return policy?",
    "How long does shipping take?",
    "Do you offer express shipping?",
    "What warranty comes with electronics?",
    "How much is the Pro plan?",
    "Can I get a discount on annual billing?",
    "How do I reset my password?",
    "What is the refund timeline?",
    "Is there free shipping?",
    "Tell me about the extended warranty.",
]

users = ["user-alice", "user-bob", "user-carol", "user-dave", "user-eve"]
environments = ["production", "staging"]

print("Generating trace data...\n")
for i, query in enumerate(test_queries):
    user_id = users[i % len(users)]
    session_id = f"session-{user_id}-{i // len(users)}"
    env_tag = environments[i % len(environments)]

    with (
        using_user(user_id),
        using_session(session_id),
        using_metadata({"environment": env_tag, "query_index": str(i)}),
        using_tags([env_tag, "rag-pipeline", "monitoring-demo"]),
    ):
        answer = support_rag_agent(query)
        print(f"[{user_id}] Q: {query}")
        print(f"         A: {answer[:80]}...\n")

    # Small delay between queries to spread data points over time
    time.sleep(0.5)

trace_provider.force_flush()
print("All traces flushed. Data is now available in Tracing.")

Expected output:

Generating trace data...

[user-alice] Q: What is your return policy?
         A: Items can be returned within 30 days of purchase. Refunds are processed in...

[user-bob] Q: How long does shipping take?
         A: Standard shipping takes 5-7 business days. Express shipping is available fo...

[user-carol] Q: Do you offer express shipping?
         A: Yes, express shipping is available and takes 1-2 business days...

...

All traces flushed. Data is now available in Tracing.

Wait 1-2 minutes for the traces to appear in the dashboard before proceeding.

Tip

For more realistic alerting scenarios, run this script multiple times across different hours or days. Alerts evaluate metrics over time windows, so more data spread over time produces better threshold previews.

Analyze historical trends in the Charts tab

Go to app.futureagi.com → Tracing (left sidebar under OBSERVE) → select your project (monitoring-demo) → click the Charts tab (4th tab, after LLM Tracing, Sessions, and Documents).

The Charts tab shows system-level performance metrics over time:

Chart	What it shows
Latency	Average response time in milliseconds across all spans
Tokens	Total token consumption (input + output) summed across spans
Traffic	Total span count — how many operations your agent executed
Cost	Average cost per span in dollars

If you have evaluation metrics configured on this project — via Inline Evals in Tracing — additional charts appear below the system metrics, one per evaluation metric.

Controls

Date range — select from presets (Today, Yesterday, 7D, 30D, 3M, 6M, 12M) or a custom range
Interval — the dropdown on the right groups data by Hour, Day, Week, or Month. Hour is disabled for ranges longer than 7 days; Month is disabled for ranges shorter than 90 days
Zoom — click and drag on any chart to zoom in. All four system metric charts sync to the same zoomed range
Refresh — re-fetch all chart data
View Traces — jump to the LLM Tracing tab with the same date filter applied

Tip

Use the Charts tab as a daily health check. A sudden spike in Latency or drop in Traffic often signals an upstream provider issue before your users notice.

Create an alert

Go to app.futureagi.com → Tracing (left sidebar under OBSERVE) → select your project (monitoring-demo) → click the Alerts tab (5th tab, after Charts).

Click Create Alerts to open the alert creation drawer.

4a. Select alert type

The first tab shows two categories:

Application Performance alerts:

Alert type	What it monitors
Count of errors	Total error count across spans
Span response time	End-to-end latency of spans
LLM response time	Latency of LLM-specific spans
LLM API failure rates	Percentage of failed LLM API calls
Error rates for function calling	Failure rate of tool/function call spans
Error free session rates	Percentage of sessions with zero errors
Service provider error rates	Errors grouped by LLM provider

Metric Alerts:

Alert type	What it monitors
Evaluation metrics	Scores from inline evals attached to traces
Token usage	Token consumption per span
Daily tokens spent	Aggregate daily token usage
Monthly tokens spent	Aggregate monthly token usage

Select LLM response time under Application Performance, then proceed to the next tab.

4b. Set alert configuration

The second tab has five sections. Fill them in order:

Name — enter High LLM Latency.

Define Metrics & Interval — the metric is pre-filled from your selection (LLM response time). Set the Interval dropdown to 15 minute interval — this is how often the alert evaluates the metric.

Filter Events — optionally click Add Filter to narrow the alert to specific span attributes (e.g., only spans from a certain environment or model). Leave empty for this example.

Define Alert — choose Static Value (alerts when the metric is above or below a fixed number). Then configure the two threshold levels:

Critical — set Threshold to Above and Value to 5000. This fires when LLM response time exceeds 5000ms
Warning — set Threshold to Above and Value to 2000. This fires when LLM response time exceeds 2000ms

The warning value must be less severe than critical (for “Above” alerts: warning < critical).

Define Notification — choose Email or Slack:

Email — enter up to 5 comma-separated email addresses
Slack — paste a Slack webhook URL and optionally add notes (e.g., the channel name)

Tip

To create a Slack webhook URL, go to your Slack workspace settings → Apps → Incoming Webhooks → Add New Webhook. Copy the URL and paste it into the Slack notification field.

Monitor and manage alerts

After creating alerts, the Alerts tab shows all alerts for this project in a searchable list. Use the search bar to find alerts by name.

View alert details

Click any alert to see:

Configuration — the alert type, thresholds, check frequency, and notification channels
Trigger history (logs) — a timeline of every time the alert fired, showing:
- Alert level: Warning or Critical
- Message describing what triggered it
- Timestamp of when it fired
- Whether it has been resolved
Current status — whether the alert is active, in warning state, in critical state, or resolved

Manage alerts

From the alert detail view or the alerts list, you can:

Mute/unmute — temporarily silence notifications without deleting the alert. Useful during maintenance windows
Edit — change thresholds, check frequency, or notification channels
Duplicate — clone an alert to create a similar one with different thresholds (e.g., duplicate the latency alert and change it to monitor token usage)
Delete — permanently remove the alert

Tip

Start with a few high-signal alerts — LLM response time, error rates, and daily token spend — rather than alerting on everything. Too many alerts cause notification fatigue and get ignored.

What you built

You can now generate rich trace data from an instrumented agent, analyze performance trends in Charts, and configure alerts with thresholds and notifications.

Instrumented a multi-step RAG agent with @tracer.agent, @tracer.tool, and @tracer.chain decorators for rich span trees
Generated diverse trace data across multiple users, sessions, and environments using context managers
Explored historical performance trends — Latency, Tokens, Traffic, and Cost — in the Charts tab with date range and interval controls
Created an LLM response time alert with static warning (2000ms) and critical (5000ms) thresholds
Configured email and Slack notifications for threshold breaches
Reviewed alert trigger history, mute/unmute controls, and alert management options

Questions & Discussion