Content Moderation
Definition
Evaluates content safety using OpenAI’s content moderation system to detect and flag potentially harmful, inappropriate, or unsafe content. This evaluation provides binary (Pass/Fail) assessment of text content against established safety guidelines.
Calculation
Eval processes text input by submitting it to OpenAI’s moderation endpoint, where it undergoes analysis across multiple safety categories, including violence, sexual content, hate speech, harassment, self-harm, profanity, and illegal activities.
The evaluation follows a three-stage process. First, an Initial Scan tokenises the text, assesses each token against safety categories, and assigns confidence scores. Next, in the Category Analysis phase, content is systematically checked against predefined safety categories, severity levels are determined, and thresholds are applied to identify potential violations. Finally, the Final Assessment aggregates results across categories, applies a binary decision based on established thresholds, and generates an output of either Pass or Fail.
The scoring logic follows a binary approach, content either meets all safety guidelines and passes, or violates one or more safety categories and fails, with no partial scores assigned.
What to do when Content Moderation Fails
When content moderation fails, the first step is to analyse the flagged content by identifying which safety categories triggered the failure and reviewing the specific problematic sections. Understanding the context and severity of violations is crucial in determining the appropriate remediation steps.
To address flagged content, modifications may include rewording while preserving meaning, implementing pre-processing safety checks, or adding content filtering before submission. If system adjustments are required, reviewing and refining safety thresholds, implementing category-specific filters, and incorporating additional pre-screening measures can enhance moderation accuracy. For more robust filtering, a multi-stage moderation pipeline may be considered.
Comparing Content Moderation with Similar Evals
- Safe for Work Text: While Content Moderation provides comprehensive safety analysis, Safe for Work Text specifically focuses on workplace appropriateness. Content Moderation is broader and includes multiple safety categories.
- Not Gibberish Text: Content Moderation focuses on safety aspects, while Not Gibberish Text evaluates text coherence and meaningfulness. They can be used together for comprehensive content quality assessment.