Definition

Prompt Injection evaluation detects attempts to manipulate or bypass the intended behaviour of language models through carefully crafted inputs. This evaluation is crucial for ensuring the security and reliability of AI systems by identifying potential security vulnerabilities in prompt handling.


Calculation

The evaluation process begins with pattern analysis, scanning for known injection patterns, command-like structures, or system instructions that may indicate an attempt to manipulate the model’s behaviour.

Next, context is examined to determine the relationship between different input components, identifying hidden instructions and detecting any context manipulation techniques.

Finally, it evaluates the potential impact of the input on system functionality, identifying attempts to access restricted features or bypass security measures.

Based on these assessments, the system returns a binary result: “Pass” if no injection attempts are detected and “Fail” if suspicious patterns are found.


What to do when Prompt Injection is Detected

If prompt injection attempt is detected, immediate actions should be taken to mitigate potential risks. This includes blocking or sanitising the suspicious input, logging the attempt for security analysis, and triggering appropriate security alerts.

To enhance system resilience, prompt injection detection patterns should be regularly updated, input validation rules should be strengthened, and additional security layers should be implemented.


Differentiating Prompt Injection with Toxicity

Prompt Injection focuses on detecting attempts to manipulate system behaviour through carefully crafted inputs designed to override or alter intended responses. In contrast, Toxicity evaluation identifies harmful or offensive language within the content, ensuring that AI-generated outputs remain appropriate and respectful.