Replay

Replay real production sessions in a dev environment using chat simulation to debug, iterate, and improve your agent.

What it is

Replay (Observe → Simulate) lets you replay real production conversations captured via Observe and rerun them in a development environment using chat simulation. When something goes wrong in production—a hallucination, tool failure, bad tone, or wrong decision—you can select the exact session or trace from Observe, create a replay session, turn it into a simulation scenario, and run the same conversation end-to-end against your dev agent. Change your agent (prompt, logic, tools) and replay again to verify fixes. This closes the loop between observability and iteration.

Under the hood, the platform creates a replay session (tied to your Observe project), generates a chat agent definition and a graph scenario from the production transcripts, and links them to a run test. You then run that run test via Chat Simulation (UI or SDK). Results appear in the same dashboard so you can compare replayed runs and spot regressions or improvements.

Replay types: session vs trace

TypeWhat is replayedUse when
SessionAll traces in a given session_id, ordered by span start time — one multi-turn conversation per session.You want to replay full production conversations as multi-turn chat scenarios.
TraceEach selected trace as a separate conversation with one turn (input → output).You want to replay individual calls or single-turn interactions.

Note

Replay does not require a new integration. It builds on Observe (to capture production sessions/traces) and Chat Simulation (to run the replayed conversations).


Use cases

  • Debug real failures — Reproduce and fix issues from production instead of relying only on synthetic test cases.
  • Reproduce edge cases — Re-run conversations that only happened in production so you can iterate on them safely.
  • Compare before vs after — Change your agent and replay the same session to see how behavior and metrics change.
  • Test fixes safely — Validate prompt, model, or tool changes without impacting live users.
  • Turn failures into regression tests — Save the replayed scenario and add it to CI or regular simulation runs.

Common workflows: Debug a bad production response (replay → fix prompt/logic/tools → replay and compare); turn a failure into a regression test (replay → save the scenario in Simulate → add to regular run tests or CI); compare agent versions (replay the same session against different versions or prompts and compare metrics and transcripts).


How to

You need Observe integrated (so production sessions and traces are in the platform), and FI_API_KEY / FI_SECRET_KEY for the replay and simulation APIs. To run the simulation via the SDK you’ll also need a chat agent callback and any LLM provider keys it uses — see Chat Simulation Using SDK.

The flow is: select production datacreate a replay sessiongenerate scenario (agent + scenario from transcripts) → create run testrun simulationview results and iterate.

Have Observe capturing production data

With Observe integrated, your production system sends sessions and traces to the platform; they are stored per project. Once that data is there, you can create a replay session from it — no extra setup for replay.

Select sessions or traces and create a replay session

From the Observe experience (e.g. your project’s sessions or traces), choose what to replay: either sessions (full multi-turn conversations by session_id) or traces (individual traces, each treated as one turn). Create a replay session with:

  • project_id — The Observe project that owns the data.
  • replay_type"session" or "trace".
  • ids — List of session IDs or trace IDs to replay, or set select_all to include all sessions or all traces for the project.

The platform creates a replay session in INIT and returns suggestions (e.g. agent_name, scenario_name, agent_description) and, if you already have replay sessions for this project, an existing agent definition to reuse. You can use these when generating the scenario in the next step.

Generate scenario (agent + scenario from transcripts)

On the replay session, trigger Generate scenario. You provide:

  • agent_name, scenario_name (required); agent_description (optional).
  • agent_type"text" (chat) or "voice"; for replay → chat simulation use text.
  • no_of_rows — How many scenario rows to generate from the transcripts (default 20).
  • Optional: personas, custom_columns, graph, generate_graph.

The platform creates or updates an agent definition for the project, creates a graph scenario (source Session Replay) from the production transcripts, and starts the scenario generation workflow. The replay session moves to GENERATING. When the workflow finishes, the scenario is ready to use in a run test.

Create a run test and run the simulation

Once the scenario is ready, create a run test that uses the replay session’s agent definition and scenario. When creating the run test, pass replay_session_id so the platform can mark the replay session as COMPLETED and link it to the new run test.

Then run the simulation the same way you run any chat simulation: from the UI (Simulate → Run Simulation, then run the new run test) or via the Chat Simulation SDK (use the run test name and your agent callback). The replayed conversations run against your dev agent; transcripts and evals are stored in the dashboard.

View results and iterate

Open the run test (or simulation) and inspect the test execution and call executions. You get the same kind of results as for any chat simulation.

Performance metrics (top of the execution view): Chat details — total chats, completed count, completion percentage. System metrics — avg output tokens, avg chat latency (ms), avg turn count, avg CSAT. Evaluation metrics — aggregated eval scores (e.g. ground truth match, task completion) showing how closely the replayed agent matches or improves on the original production behavior.

Session list — Each row is one replayed session. Compare CSAT, token usage (total, input, output), and per-eval scores across runs. Single session — Click a session to see the turn-by-turn transcript (and, where available, a diff or comparison to the original production conversation) so you can see exactly where the agent’s responses, tool calls, or decisions changed after your fix.

Update your agent (prompt, logic, tools, or model) and replay again to verify improvements.


What you can do next

Was this page helpful?

Questions & Discussion