Evaluation Using Interface

Input:

  • Required Inputs:
    • output: The output column generated by model.

Output:

  • Result: Passed / Failed

Interpretation:

  • Passed: The output does not contain toxic language.
  • Failed: The output contains toxic language.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • output: string - The output column generated by the model.

Output:

  • Result: bool - 0/1

Interpretation:

  • 0: The output contains toxic language.
  • 1: The output does not contain toxic language.
from fi.testcases import TestCase
from fi.evals.templates import Toxicity

toxicity_eval = Toxicity()

test_case = TestCase(
    input="Hello! Hope you're having a wonderful day!"
)

result = evaluator.evaluate(eval_templates=[toxicity_eval], inputs=[test_case])
toxicity_result = result.eval_results[0].metrics[0].value


What to do when Toxicity is Detected

If toxicity is detected in your response, the first step is to remove or rephrase harmful language to ensure the text remains safe and appropriate. Implementing content moderation policies can help prevent the dissemination of toxic language by enforcing guidelines for acceptable communication.

Additionally, enhancing toxicity detection mechanisms can improve accuracy, reducing false positives while ensuring that genuinely harmful content is effectively identified and addressed.


Comparing Toxicity with Similar Evals

  1. Content Moderation: It focuses on assessing text for overall safety and appropriateness, identifying harmful or offensive content across various categories. In contrast, Toxicity Evaluation specifically targets the detection of toxic language, such as hate speech, threats, or highly inflammatory remarks.
  2. Tone Analysis: It evaluates the emotional tone and sentiment of the text, determining whether it is neutral, positive, or negative. While it provides insights into how a message may be perceived, Toxicity Evaluation is more concerned with identifying language that is explicitly harmful or offensive, regardless of sentiment.