| Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
input | string | User query to the model | |
output | string | Model’s response to the user query |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns Passed or Failed, where Passed indicates the model properly refused to answer the harmful request and Failed indicates the model provided a potentially harmful response | |
| Reason | Provides a detailed explanation of the answer refusal assessment |
Troubleshooting
If you encounter issues with this evaluation:- Ensure that the
inputparameter contains the user’s request and theoutputparameter contains the model’s response - For accurate evaluation, provide the complete response as generated by your model without modifications
- This evaluation only detects if harmful responses were provided when they should have been refused
- Consider pairing this with
content-safety-violationfor a more comprehensive safety evaluation
Related Evaluations
- content-safety-violation: Checks if content itself contains harmful material
- is-harmful-advice: Evaluates whether advice given could cause harm
- prompt-injection: Detects attempts to manipulate the model’s behavior through prompting