Content Moderation
Evaluates content safety using OpenAI’s content moderation system to detect and flag potentially harmful, inappropriate, or unsafe content. This evaluation provides binary (Pass/Fail) assessment of text content against established safety guidelines.
Evaluation Using Interface
Input:
- Required Inputs:
- text: The text content column to moderate.
Output:
- Score: Float score between 0 and 1
Interpretation:
- Higher scores (closer to 1): Indicate safer content, less likely to contain harmful, inappropriate, or unsafe material.
- Lower scores (closer to 0): Indicate potentially inappropriate content that may require review or filtering.
Evaluation Using Python SDK
Click here to learn how to setup evaluation using the Python SDK.
Input Type | Parameter | Type | Description |
---|---|---|---|
Required Inputs | text | string | The text content to moderate. |
Output | Type | Description |
---|---|---|
Score | bool | Returns a score between 0 and 1. Higher values indicate safer content, lower values indicate potentially inappropriate content. |
What to do when Content Moderation Fails
When content moderation fails, the first step is to analyse the flagged content by identifying which safety categories triggered the failure and reviewing the specific problematic sections. Understanding the context and severity of violations is crucial in determining the appropriate remediation steps.
To address flagged content, modifications may include rewording while preserving meaning, implementing pre-processing safety checks, or adding content filtering before submission. If system adjustments are required, reviewing and refining safety thresholds, implementing category-specific filters, and incorporating additional pre-screening measures can enhance moderation accuracy. For more robust filtering, a multi-stage moderation pipeline may be considered.
Comparing Content Moderation with Similar Evals
- Safe for Work Text: While Content Moderation provides comprehensive safety analysis, Safe for Work Text specifically focuses on workplace appropriateness. Content Moderation is broader and includes multiple safety categories.
- Not Gibberish Text: Content Moderation focuses on safety aspects, while Not Gibberish Text evaluates text coherence and meaningfulness. They can be used together for comprehensive content quality assessment.
Was this page helpful?