Running Your First Eval

Evaluate Using SDK

Recommended: evaluate() Function

The fastest way to run evaluations. Works with local metrics (no API key), cloud models, and LLM judges.

Install

pip install ai-evaluation

Run Your First Eval (No API Key Needed)

from fi.evals import evaluate

# Check if output contains expected content
result = evaluate("contains", output="Hello World", keyword="Hello")
print(result.score)   # 1.0
print(result.passed)  # True

# Check for hallucinations
result = evaluate(
    "faithfulness",
    output="The capital of France is Paris.",
    context="Paris is the capital of France.",
)
print(result.score)   # ~0.95

Use an LLM as Judge

export GOOGLE_API_KEY=your-key

result = evaluate(
    prompt="Rate how helpful this response is. Score 1.0 for very helpful, 0.0 for unhelpful.",
    output="Here are 3 steps to fix your issue...",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)
print(result.score)   # 0.9

Evaluate Images and Audio

result = evaluate(
    prompt="Does the description match the image? Score 1.0 if yes, 0.0 if no.",
    output="A white daisy flower.",
    image_url="https://example.com/flower.jpg",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

See the complete evaluate() API reference for all 72+ metrics, multimodal support, auto-generated prompts, and more.

Alternative: Evaluator Class (Cloud Only)

Setup Evaluator

Install the Future AGI Python SDK using below command:

pip install ai-evaluation

Then initialise the Evaluator:

from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key",
)

Click here to learn how to access your API keys.

We recommend you to set the fi_api_key and fi_secret_key environment variables before using the Evaluator class, instead of passing them as parameters.

This section walks you through the process of running your first evaluation using the Future AGI evaluation framework. To get started, we’ll use Tone Evaluation as an example.

a. Using Python SDK (Sync)

result = evaluator.evaluate(
    eval_templates="tone",
    inputs={
        "input": "Dear Sir, I hope this email finds you well. I look forward to any insights or advice you might have whenever you have a free moment"
    },
    model_name="turing_flash",
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

b. Using Python SDK (Async)

For long-running evaluations or when you want to run evaluations in the background, you can use the asynchronous evaluation feature. This is particularly useful when evaluating large datasets.

Running Async Evaluations

To run an evaluation asynchronously, set the is_async parameter to True:

# Start an asynchronous evaluation
result = evaluator.evaluate(
    eval_templates="tone",
    inputs={
        "input": "Dear Sir, I hope this email finds you well. I look forward to any insights or advice you might have whenever you have a free moment"
    },
    model_name="turing_flash",
    is_async=True  # Run evaluation asynchronously
)

# Get the evaluation ID for later retrieval
eval_id = result.eval_results[0].eval_id
print(f"Evaluation started with ID: {eval_id}")

Retrieving Results

Once you have the evaluation ID, you can retrieve the results at any time using get_eval_result:

This function can be used to get the evaluation result of both sync and async evaluations.

# Retrieve the evaluation results
result = evaluator.get_eval_result(eval_id)
print(result.eval_results[0].output)
print(result.eval_results[0].reason)

Click here to read more about all the Future AGI models

Click here to read more about all the Evals provided by Future AGI

To Evaluate the data on your own evaluation template which you have created, you can use the evaluate function with the eval_templates parameter.

from fi.evals import evaluate

result = evaluate(
    eval_templates="name-of-your-eval", 
    inputs={
        "input": "your_input_text",
        "output": "your_output_text"
    }, 
    model_name="model_name"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

Evaluate Using UI

Select a DatasetBefore running an evaluation, ensure you have selected a dataset. If no dataset is available, follow the steps to Add Dataset on the Future AGI platform.Read more about all the ways you can add datasetAccess the Evaluation Panel

Navigate to your dataset.
Click on the Evaluate button in the top-right menu.
This will open the evaluation configuration panel.

Starting a New Evaluation

Click on the Add Evaluation button.
You will be directed to the Evaluation List page. You can either create your own evaluation or select from the available templates built by Future AGI.
Click on one of the available templates.
Write the name of the evaluation and select the required dataset column.

Checkmark on Error Localization if you want to localize the errors in the dataset when the datapoint is evaluated and fails the evaluation.

Click on the Add & Run button.

Creating a New Evaluation

Future AGI provides a wide range of evaluation templates to choose from. You can create your own evaluation to tailor your needs by following below simple steps:

Click on the Create your own eval button after clicking on the Add Evaluation button.
Write the name of the evaluation
This name will be used to identify the evaluation in the evaluation list. only lower case letters, numbers and underscores are allowed in the name.
Select either Use Future AGI Models or Use other LLMs
- TURING_LARGE turing_flash: Flagship evaluation model that delivers best-in-class accuracy across multimodal inputs (text, images, audio). Recommended when maximal precision outweighs latency constraints.
- TURING_SMALL turing_small: Compact variant that preserves high evaluation fidelity while lowering computational cost. Supports text and image evaluations.
- TURING_FLASH turing_flash: Latency-optimised version of TURING, providing high-accuracy assessments for text and image inputs with fast response times.
- PROTECT protect: Real-time guardrailing model for safety, policy compliance, and content-risk detection. Offers very low latency on text and audio streams and permits user-defined rule sets.
- PROTECT_FLASH protect_flash: Ultra-fast binary guardrail for text content. Designed for first-pass filtering where millisecond-level turnaround is critical.
In the Rule Prompt, you can write the rules that the evaluation should follow. Use {{}} to create a key (variable), that variable will be used in future when you configure the evaluation.
Choose Output Type As either Pass/Fail or Percentage or Deterministic Choices
- Pass/Fail: The evaluation will return either Pass or Fail.
- Percentage: The evaluation will return a Score between 0 and 100.
- Deterministic Choices: The evaluation will return a categorical choice from the list of choices.
Select the Tags for the evaluation that are suitable to use case.
Write the description of the evaluation that will be used to identify the evaluation in the evaluation list.
Checkmark on Check Internet to power your evaluation with the latest information.
Click on the Create Evaluation button.

Get Started

Guides

Recommended: evaluate() Function

Install

Run Your First Eval (No API Key Needed)

Use an LLM as Judge

Evaluate Images and Audio

Alternative: Evaluator Class (Cloud Only)

Setup Evaluator

a. Using Python SDK (Sync)

b. Using Python SDK (Async)

Running Async Evaluations

Retrieving Results

Creating a New Evaluation

Get Started

Guides

​Recommended: evaluate() Function

​Install

​Run Your First Eval (No API Key Needed)

​Use an LLM as Judge

​Evaluate Images and Audio

​Alternative: Evaluator Class (Cloud Only)

​Setup Evaluator

​a. Using Python SDK (Sync)

​b. Using Python SDK (Async)

​Running Async Evaluations

​Retrieving Results

​Creating a New Evaluation

Recommended: evaluate() Function

Install

Run Your First Eval (No API Key Needed)

Use an LLM as Judge

Evaluate Images and Audio

Alternative: Evaluator Class (Cloud Only)

Setup Evaluator

a. Using Python SDK (Sync)

b. Using Python SDK (Async)

Running Async Evaluations

Retrieving Results

Creating a New Evaluation