Create Your Own Evaluations
Creating custom evaluations allows you to tailor assessment criteria to your specific use case and business requirements. Future AGI provides flexible tools to build evaluations that go beyond standard templates, enabling you to define custom rules, scoring mechanisms, and validation logic.
Why Create Custom Evaluations?
While Future AGI offers comprehensive evaluation templates, custom evaluations are essential when you need:
- Domain-Specific Validation: Assess content against industry-specific standards or regulations
- Business Rule Compliance: Ensure outputs meet your organization’s unique guidelines
- Complex Scoring Logic: Implement multi-criteria assessments with weighted scoring
- Custom Output Formats: Validate specific response structures or formats unique to your application
Creating Custom Evaluations
Using Web Interface
Step 1: Access Evaluation Creation
- Navigate to your dataset in the Future AGI platform
- Click on the Evaluate button in the top-right menu
- Click on Add Evaluation button
- Select Create your own eval
Step 2: Configure Basic Settings
Start by setting up the fundamental properties of your evaluation:
-
Name: Enter a unique evaluation name (lowercase letters, numbers, and underscores only)
-
Model Selection: Choose the appropriate model for your evaluation complexity:
- Future AGI Models: Proprietary models optimized for evaluations
-
TURING_LARGE
turing_flash
: Flagship evaluation model that delivers best-in-class accuracy across multimodal inputs (text, images, audio). Recommended when maximal precision outweighs latency constraints. -
TURING_SMALL
turing_small
: Compact variant that preserves high evaluation fidelity while lowering computational cost. Supports text and image evaluations. -
TURING_FLASH
turing_flash
: Latency-optimised version of TURING, providing high-accuracy assessments for text and image inputs with fast response times. -
PROTECT
protect
: Real-time guardrailing model for safety, policy compliance, and content-risk detection. Offers very low latency on text and audio streams and permits user-defined rule sets. -
PROTECT_FLASH
protect_flash
: Ultra-fast binary guardrail for text content. Designed for first-pass filtering where millisecond-level turnaround is critical.
- Other LLMs: Use external language models from providers like OpenAI, Anthropic, or your own custom models.
Step 3: Define Evaluation Rules
- Rule Prompt: Write the evaluation criteria and instructions
- Use
{{variable_name}}
syntax to create dynamic variables that will be mapped to dataset columns - Be specific about what constitutes a pass/fail or scoring criteria
Step 4: Configure Output Type
- Pass/Fail: Binary evaluation (1.0 for pass, 0.0 for fail)
- Percentage: Numerical score between 0 and 100
- Deterministic Choices: Select from predefined categorical options
Step 5: Additional Settings
- Tags: Add relevant tags for organization and filtering
- Description: Provide a clear description of the evaluation’s purpose
- Check Internet: Enable web access for real-time information validation
Example: Creating a Chatbot Evaluation
Let’s walk through creating a custom evaluation for a customer service chatbot. This example will show how to ensure the chatbot’s responses are both polite and effectively address user queries.
Step 1: Basic Configuration
- Name:
chatbot_politeness_and_relevance
- Model Selection:
TURING_SMALL
(ideal for straightforward evaluations like this) - Description: “Evaluates if the chatbot’s response is polite and relevant to the user’s query.”
Step 2: Define Evaluation Rules
Create a rule prompt that clearly specifies the evaluation criteria:
Step 3: Configure Output
- Output Type:
Pass/Fail
(1.0 for pass, 0.0 for fail) - Tags:
customer-service
,politeness
,relevance
Step 4: Map Variables
In your dataset, map the variables to their corresponding columns:
{{user_query}}
→ Column containing user questions{{chatbot_response}}
→ Column containing chatbot responses
This evaluation will help ensure your chatbot maintains high standards of interaction by checking both the tone and relevance of responses. The pass/fail output makes it easy to quickly identify responses that need improvement.
Running the Evaluation
You can either run the evaluation through the web interface or using the SDK.
Using Web Interface
- Navigate to your dataset in the Future AGI platform
- Click on the Evaluate button in the top-right menu
- Click on the evaluation you just created
- Configure the columns that you want to use for the evaluation
- Click on the Add & Run button
Using SDK
After creating the evaluation, you can run it using the FutureAGI SDK.