The Problem
You are serving LLM responses via streaming (SSE or WebSocket). The LLM starts generating a helpful response… then suddenly veers into toxic, harmful, or off-topic territory. You cannot wait for the full response — by then, the user has already read the toxic content. You need to monitor the stream token-by-token and cut it off the moment things go wrong.What You Will Learn
- How to create a
StreamingEvaluatorwith custom scoring functions - How to track toxicity, coherence, and topic coverage as tokens arrive
- How to set early-stop policies that kill the stream on threshold breaches
- How to register callbacks for real-time alerting
- How to use
evaluate_stream()for one-shot processing
Prerequisites
Define Scoring Functions
TheStreamingEvaluator accepts any function with the signature (chunk: str, full_text: str) -> float. Here are three examples:
Scenario 1: Normal Response Completes Successfully
Create a monitor with toxicity and coherence checks. The stream completes without issues.Scenario 2: Toxic Turn — Stream Gets Cut
The response starts fine, then turns toxic. TheEarlyStopPolicy.strict() policy kills the stream immediately on the first violation.
Scenario 3: Topic Drift Detection
The response starts on-topic (bread baking) but gradually drifts into physics. The topic score degrades over time.Scenario 4: Real-Time Alerting with Callbacks
Register callbacks that fire on every chunk violation or emergency stop. Use these for logging, metrics, or alerting.Scenario 5: One-Shot Stream Check
If you already have a generator, pass it directly toevaluate_stream() for a single-call check.
What to Try Next
You now have real-time safety monitoring. Next, learn how to auto-generate an entire test pipeline from a plain-English description of your application.Next: AutoEval
Describe your app and get an auto-configured testing pipeline you can export to CI/CD.