Voice Replay: Debug Voice Agents from Production Calls

Replay real production voice calls from Future AGI Observe in simulation to debug, iterate, and improve your voice agent based on real usage.

What it is

Voice Replay (Observe → Simulate) lets you replay real production voice calls captured via Voice Observability and rerun them in a development environment using voice simulation. When something goes wrong in production -a misunderstood order, wrong tool call, poor latency, or bad tone -you can select the exact voice trace from Observe, create a replay session, turn it into a simulation scenario, and run a new voice call end-to-end against your dev agent. Change your agent (prompt, model, voice settings) and replay again to verify fixes. This closes the loop between voice observability and iteration.

Under the hood, the platform extracts the original voice configuration (system prompt, assistant settings, provider config) from the production trace’s raw call log, creates a voice agent definition with a configuration snapshot matching the original call, and generates a graph scenario from the production conversation. You then run the scenario via Voice Simulation (UI or SDK). Results include side-by-side transcript comparison, performance metrics comparison, and audio recording playback for both the baseline and replayed calls.

Note

Voice Replay currently supports Vapi as the primary provider. Retell is supported for transcript comparison but config extraction during replay setup is optimized for Vapi’s data structure.

Use cases

Debug voice agent failures -Reproduce misunderstood intents, wrong tool calls, or hallucinations from real production calls.
Compare call quality -Replay the same conversation after changing your prompt, model, or voice settings and compare latency, WPM, and talk ratio side by side.
Test provider changes -Switch from one voice provider or model to another and replay the same scenarios to measure impact.
Iterate on voice UX -Improve first messages, interruption handling, or response length by replaying real caller interactions.
Turn failures into regression tests -Save the replayed scenario and add it to regular simulation runs or CI.

How to

You need Voice Observability integrated (so production voice calls are captured with their recordings and transcripts), and FI_API_KEY / FI_SECRET_KEY for the replay and simulation APIs.

The flow is: select voice traces → create a replay session → generate scenario (agent + scenario from audio/transcripts) → create run test → run voice simulation → compare with baseline and iterate.

Have Voice Observability capturing production calls

With Voice Observability integrated, your production voice calls (via Vapi, Retell, or other supported providers) are captured as traces with conversation-type spans. Each span stores the full call data including transcripts, recordings, and call metrics. See Set Up Voice Observability for integration details.

Select voice traces and create a replay session

From the Observe experience, select the voice traces you want to replay. Create a replay session with:

project_id -The Observe project that owns the voice traces.
replay_type -"trace" (each voice trace is one complete call).
ids -List of trace IDs to replay, or set select_all to include all voice traces.

The platform detects that these are voice traces (by checking for conversation-type spans), extracts the original voice configuration from the raw call log (system prompt, assistant ID, provider, model, phone number), and returns suggestions including agent_type: "voice" and the extracted config.

Select voice traces from Observe

Generate scenario (agent + scenario from audio)

On the replay session, trigger Generate scenario. You provide:

agent_name, scenario_name (required); agent_description is auto-extracted from the original call’s system prompt.
agent_type -"voice".
no_of_rows -How many scenario rows to generate (default 20).

Create scenarios form with agent definition and scenario details

The platform:

Creates a voice agent definition with the original provider config (assistant ID, model, voice settings) preserved in the agent version’s configuration snapshot.
Extracts user intents from each trace -if recording URLs are available, the audio is used for intent extraction. If no recordings exist, text transcripts are used as a fallback.
Generates a graph scenario (source Session Replay) with persona, situation, and outcome columns derived from the call data.

The replay session moves to GENERATING. When the workflow finishes, the scenario is ready.

Scenario generation in progress

Once generated, you can review the scenario rows with persona, situation, and outcome details.

Generated scenario rows with persona and situation details

Map eval variables and start replay

After scenarios are generated, you can optionally map eval variables -connect scenario columns (like expected outcome or situation context) to evaluation metrics so the platform can automatically score each replayed call. You can also add additional evaluations after the replay.

Then click Start Replay to create a run test linked to the replay session.

Run the voice simulation

Once the run test is created, run the voice simulation -the platform calls the voice provider using the preserved configuration snapshot, so the replayed call uses the same assistant settings, model, and voice as the original production call. Each scenario row generates a new voice call.

Compare with baseline and iterate

After the simulation completes, open a call execution and click Compare with baseline call to see a side-by-side comparison:

Performance metrics -Call Duration, Turn Count, Avg Agent Latency (ms), User WPM, Bot WPM, and Talk Ratio, each showing the value, absolute change, and percentage change from the baseline call.

Audio recordings -Play back both the baseline and replayed call recordings (stereo, mono combined, mono customer, mono assistant) directly in the UI.

Transcript comparison -Side-by-side transcripts of the baseline call and the replayed call. Use Show Diff to highlight differences between the two conversations.

Update your agent (prompt, model, voice settings, or tools) and replay again to verify improvements.

Compare with baseline call

Note

The Compare with baseline call button only appears for call executions that originated from a replay session (where a baseline trace exists to compare against).

Questions & Discussion

Voice Replay: Debug Voice Agents from Production Calls

What it is

Use cases

How to

Have Voice Observability capturing production calls

Select voice traces and create a replay session

Generate scenario (agent + scenario from audio)

Map eval variables and start replay

Run the voice simulation

Compare with baseline and iterate

What you can do next

Chat Replay

Voice Observability

Scenarios

Agent Definition