Definition

It allows the execution of custom Python code to assess specific evaluation criteria. This evaluation is highly flexible, enabling users to define their own logic for determining the pass or fail status of a given task. It is particularly useful for scenarios where standard evaluation methods do not suffice, and custom logic is required to meet unique requirements.

A Passed result indicates that the custom code executed successfully and met all predefined conditions, while a Failed result signifies that the code did not meet the expected criteria or encountered errors during execution.


Calculation

Eval starts by validating the provided code to ensure it meets the required format, verifying that the main function is correctly structured and free of syntax errors. Once confirmed, the system sets up a secure execution environment, seamlessly loading essential dependencies and granting controlled access to prompt run fields via kwargs, ensuring smooth and efficient execution.

During code execution, the system runs the custom script within a controlled environment, handling any runtime errors or exceptions while capturing the execution output.

Finally, eval processes the execution output, applies predefined success or failure criteria, and determines the final pass/fail status.


What to do when Custom Code Eval Fails

Do code review for checking syntax errors, verifying that the function is correctly implemented, and ensuring all required dependencies are available. Input validation ensures that all necessary arguments are properly accessed and that input data types and formats align with expected requirements.


Differentiating Custom Code Eval with Deterministic Eval

Deterministic Evals and Custom Code Eval share flexibility and customisation capabilities, allowing for tailored evaluation logic. Both can be configured for different types of outputs, with Deterministic Evals utilising rule prompts to guide evaluations.

However, Custom Code Eval executes actual Python code, enabling dynamic computations and logic, while Deterministic Evals rely on structured, rule-based evaluation methods.