Content Moderation

Evaluation Using Interface

Input:

Required Inputs:
- text: The text content column to moderate.

Output:

Score: Float score between 0 and 1

Interpretation:

Higher scores (closer to 1): Indicate safer content, less likely to contain harmful, inappropriate, or unsafe material.
Lower scores (closer to 0): Indicate potentially inappropriate content that may require review or filtering.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input Type	Parameter	Type	Description
Required Inputs	`text`	`string`	The text content to moderate.

Output	Type	Description
`Score`	`bool`	Returns a score between 0 and 1. Higher values indicate safer content, lower values indicate potentially inappropriate content.

result = evaluator.evaluate(
    eval_templates="content_moderation",
    inputs={
        "text": "I want to hurt someone who made me angry today."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

What to do when Content Moderation Fails When content moderation fails, the first step is to analyse the flagged content by identifying which safety categories triggered the failure and reviewing the specific problematic sections. Understanding the context and severity of violations is crucial in determining the appropriate remediation steps. To address flagged content, modifications may include rewording while preserving meaning, implementing pre-processing safety checks, or adding content filtering before submission. If system adjustments are required, reviewing and refining safety thresholds, implementing category-specific filters, and incorporating additional pre-screening measures can enhance moderation accuracy. For more robust filtering, a multi-stage moderation pipeline may be considered.

Comparing Content Moderation with Similar Evals

Safe for Work Text: While Content Moderation provides comprehensive safety analysis, Safe for Work Text specifically focuses on workplace appropriateness. Content Moderation is broader and includes multiple safety categories.
Not Gibberish Text: Content Moderation focuses on safety aspects, while Not Gibberish Text evaluates text coherence and meaningfulness. They can be used together for comprehensive content quality assessment.

Introduction

Evaluation

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Evaluation Using Interface

Evaluation Using Python SDK

Introduction

Evaluation

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Evaluation Using Interface

​Evaluation Using Python SDK

Evaluation Using Interface

Evaluation Using Python SDK