Create Custom Evals

Why Create Custom Evaluations?

While Future AGI offers comprehensive evaluation templates, custom evaluations are essential when you need:

Domain-Specific Validation: Assess content against industry-specific standards or regulations
Business Rule Compliance: Ensure outputs meet your organization’s unique guidelines
Complex Scoring Logic: Implement multi-criteria assessments with weighted scoring
Custom Output Formats: Validate specific response structures or formats unique to your application

Creating Custom Evaluations

Using Web Interface

Step 1: Access Evaluation Creation

Navigate to your dataset in the Future AGI platform
Click on the Evaluate button in the top-right menu
Click on Add Evaluation button
Select Create your own eval

Step 2: Configure Basic Settings Start by setting up the fundamental properties of your evaluation:

Name: Enter a unique evaluation name (lowercase letters, numbers, and underscores only)
Model Selection: Choose the appropriate model for your evaluation complexity:
- Future AGI Models: Proprietary models optimized for evaluations
- TURING_LARGE turing_flash: Flagship evaluation model that delivers best-in-class accuracy across multimodal inputs (text, images, audio). Recommended when maximal precision outweighs latency constraints.
- TURING_SMALL turing_small: Compact variant that preserves high evaluation fidelity while lowering computational cost. Supports text and image evaluations.
- TURING_FLASH turing_flash: Latency-optimised version of TURING, providing high-accuracy assessments for text and image inputs with fast response times.
- PROTECT protect: Real-time guardrailing model for safety, policy compliance, and content-risk detection. Offers very low latency on text and audio streams and permits user-defined rule sets.
- PROTECT_FLASH protect_flash: Ultra-fast binary guardrail for text content. Designed for first-pass filtering where millisecond-level turnaround is critical.
- Other LLMs: Use external language models from providers like OpenAI, Anthropic, or your own custom models.
Click here to learn how to add custom models.

Step 3: Define Evaluation Rules

Rule Prompt: Write the evaluation criteria and instructions
Use {{variable_name}} syntax to create dynamic variables that will be mapped to dataset columns
Be specific about what constitutes a pass/fail or scoring criteria

Step 4: Configure Output Type

Pass/Fail: Binary evaluation (1.0 for pass, 0.0 for fail)
Percentage: Numerical score between 0 and 100
Deterministic Choices: Select from predefined categorical options

Step 5: Additional Settings

Tags: Add relevant tags for organization and filtering
Description: Provide a clear description of the evaluation’s purpose
Check Internet: Enable web access for real-time information validation

Example: Creating a Chatbot Evaluation

Let’s walk through creating a custom evaluation for a customer service chatbot. This example will show how to ensure the chatbot’s responses are both polite and effectively address user queries.

Step 1: Basic Configuration

Name: chatbot_politeness_and_relevance
Model Selection: TURING_SMALL (ideal for straightforward evaluations like this)
Description: “Evaluates if the chatbot’s response is polite and relevant to the user’s query.”

Step 2: Define Evaluation Rules

Create a rule prompt that clearly specifies the evaluation criteria:

Evaluate the chatbot's response based on two criteria:
1.  **Politeness**: Is the language used courteous and respectful?
2.  **Relevance**: Does the response directly address the user's query: '{{user_query}}'?

The user's query was: {{user_query}}
The chatbot's response was: {{chatbot_response}}

Provide a pass/fail score. The response passes if it is both polite and relevant.

Step 3: Configure Output

Output Type: Pass/Fail (1.0 for pass, 0.0 for fail)
Tags: customer-service, politeness, relevance

Step 4: Map Variables

In your dataset, map the variables to their corresponding columns:

{{user_query}} → Column containing user questions
{{chatbot_response}} → Column containing chatbot responses

This evaluation will help ensure your chatbot maintains high standards of interaction by checking both the tone and relevance of responses. The pass/fail output makes it easy to quickly identify responses that need improvement.

Running the Evaluation

You can either run the evaluation through the web interface or using the SDK.

Using Web Interface

Navigate to your dataset in the Future AGI platform
Click on the Evaluate button in the top-right menu
Click on the evaluation you just created
Configure the columns that you want to use for the evaluation
Click on the Add & Run button

Using SDK

After creating the evaluation, you can run it using the FutureAGI SDK.

pip install futureagi
pip install ai-evaluation

from fi.evals import Evaluator

evaluator = Evaluator(fi_api_key="your_api_key", fi_secret_key="your_secret_key")

eval_result = evaluator.evaluate(eval_templates = "your_eval_template_name", # Name of the evaluation template to use
                                inputs = {
                                    "input": "your_input_text",
                                    "output": "your_output_text"
                                }, # Data can be sent as a dictionary or using TestCase class
                                timeout = 10, #Optional Timeout in seconds
                                model_name = "turing_small" #  Model name to use for Future AGI Built Evals
)

print(eval_result)

Get Started

Guides

Why Create Custom Evaluations?

Creating Custom Evaluations

Using Web Interface

Example: Creating a Chatbot Evaluation

Step 1: Basic Configuration

Step 2: Define Evaluation Rules

Step 3: Configure Output

Step 4: Map Variables

Running the Evaluation

Using Web Interface

Using SDK

Get Started

Guides

​Why Create Custom Evaluations?

​Creating Custom Evaluations

​Using Web Interface

​Example: Creating a Chatbot Evaluation

​Step 1: Basic Configuration

​Step 2: Define Evaluation Rules

​Step 3: Configure Output

​Step 4: Map Variables

​Running the Evaluation

​Using Web Interface

​Using SDK

Why Create Custom Evaluations?

Creating Custom Evaluations

Using Web Interface

Example: Creating a Chatbot Evaluation

Step 1: Basic Configuration

Step 2: Define Evaluation Rules

Step 3: Configure Output

Step 4: Map Variables

Running the Evaluation

Using Web Interface

Using SDK