Evaluate via CI/CD Pipeline

Run Future AGI evaluations in your CI/CD pipeline to assess model performance on every pull request and keep quality checks consistent before deployment.

About

CI/CD evaluation brings quality checks into your existing development workflow. Every time code changes, your eval suite runs automatically, scores your AI outputs against the templates you define, and tracks results by version.

This catches regressions before they ship and gives your team a versioned history of how AI quality changes over time. You can compare any two versions side by side to see exactly where things improved or dropped.


When to use

  • Gate PRs on quality: Run evals on every PR so regressions in tone, factual consistency, or custom metrics block or flag merges before they land.
  • Compare versions in CI: Submit evaluations with a version tag and compare results across versions in one place.
  • Automate quality reporting: Post eval results as a PR comment so reviewers see model performance without leaving GitHub.
  • Repeatable checks: Use the same eval templates and inputs in CI so every run is directly comparable.

Prerequisites

  • A Future AGI account with API key and secret key
  • A CI system that can run Python (GitHub Actions, GitLab CI, Jenkins, or any runner with Python and network access)
  • The ai-evaluation package (pip install ai-evaluation>=0.1.7)

Required packages

pandas
requests
tabulate
ai-evaluation>=0.1.7
python-dotenv

Required secrets

Set these as environment variables or in your CI’s secret store. Do not commit them.

SecretDescription
FI_API_KEYYour Future AGI API key
FI_SECRET_KEYYour Future AGI secret key
PAT_GITHUBPersonal Access Token for repository access (GitHub Actions only)

Required variables

VariableDescriptionDefault
PROJECT_NAMEFuture AGI project nameVoice Agent
VERSIONCurrent version identifierv0.1.0
COMPARISON_VERSIONSComma-separated versions to compare against(empty)

Core SDK Functions

The pipeline uses two SDK functions: evaluate_pipeline to submit an eval run tagged to a version, and get_pipeline_results to retrieve and compare results across versions.

Initialize the Evaluator

from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.getenv("FI_API_KEY"),
    fi_secret_key=os.getenv("FI_SECRET_KEY"),
)

Define Evaluation Data

Structure a list of eval configs. Each has an eval_template, model_name, and inputs (keys mapped to lists of values). For more on templates and inputs, see Running your first eval.

eval_data = [
    {
        "eval_template": "tone",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "This product is amazing!",
                "I am very disappointed with the service."
            ]
        }
    },
    {
        "eval_template": "groundedness",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "context": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "output": [
                "The capital of France is Paris.",
                "William Shakespeare wrote Hamlet."
            ]
        }
    }
]

Submit Evaluation Pipeline

result = evaluator.evaluate_pipeline(
    project_name="my-project",
    version="v0.1.5",
    eval_data=eval_data,
)
ParameterDescription
project_nameYour project identifier
versionVersion tag for this run (e.g. branch name or commit SHA)
eval_dataList of evaluation configurations (template, model, inputs)

Retrieve Results

result = evaluator.get_pipeline_results(
    project_name="my-project",
    versions=["v0.1.0", "v0.1.1", "v0.1.5"],
)
ParameterDescription
project_nameYour project identifier
versionsList of version tags to retrieve results for

Full GitHub Actions Implementation

Workflow File

Create .github/workflows/evaluation.yml:

name: Run Evaluation on PR

on:
  pull_request:
    branches:
      - main

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.PAT_GITHUB }}

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation script
        run: python evaluate_pipeline.py
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.number }}
          REPO_NAME: ${{ github.repository }}
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
          PROJECT_NAME: ${{ vars.PROJECT_NAME || 'Voice Agent' }}
          VERSION: ${{ vars.VERSION || 'v0.1.0' }}
          COMPARISON_VERSIONS: ${{ vars.COMPARISON_VERSIONS || '' }}

Note

Critical: You must specify pull-requests: write in your workflow permissions. Without this, the action cannot post comments on your PR.

Evaluation Script

Create evaluate_pipeline.py:

from dotenv import load_dotenv
load_dotenv()

import os
import json
import time
import requests
import pandas as pd
from fi.evals import Evaluator

# Define your evaluation data - CUSTOMIZE THIS SECTION
eval_data = [
    {
        "eval_template": "tone",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "This product is amazing!",
                "I am very disappointed with the service."
            ]
        }
    },
    {
        "eval_template": "groundedness",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "context": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "output": [
                "The capital of France is Paris.",
                "William Shakespeare wrote Hamlet."
            ]
        }
    }
]

def post_github_comment(content):
    """Posts a comment to a GitHub pull request."""
    repo = os.getenv("REPO_NAME")
    pr_number = os.getenv("PR_NUMBER")
    token = os.getenv("GITHUB_TOKEN")

    if not all([repo, pr_number, token]):
        print("Missing GitHub details. Skipping comment.")
        return

    url = f"https://api.github.com/repos/{repo}/issues/{pr_number}/comments"
    headers = {
        "Authorization": f"token {token}",
        "Accept": "application/vnd.github.v3+json",
    }
    data = {"body": content}

    response = requests.post(url, headers=headers, data=json.dumps(data))

    if response.status_code == 201:
        print("Successfully posted comment to PR.")
    else:
        print(f"Failed to post comment. Status code: {response.status_code}")

def poll_for_completion(evaluator, project_name, current_version,
                        comparison_versions_str="", max_wait_time=600,
                        poll_interval=30):
    """Polls for evaluation completion by fetching all versions."""
    start_time = time.time()

    comparison_versions = []
    if comparison_versions_str:
        comparison_versions = [v.strip() for v in comparison_versions_str.split(',') if v.strip()]

    all_versions = list(set([current_version] + comparison_versions))

    while time.time() - start_time < max_wait_time:
        elapsed_time = int(time.time() - start_time)
        print(f"Polling for results (elapsed: {elapsed_time}s/{max_wait_time}s)...")

        try:
            result = evaluator.get_pipeline_results(
                project_name=project_name,
                versions=all_versions
            )

            if result.get('status'):
                api_result = result.get('result', {})
                status = api_result.get('status', 'unknown')
                evaluation_runs = api_result.get('evaluation_runs', [])

                if status == 'completed':
                    print(f"All requested versions are complete.")
                    return evaluation_runs
                elif status in ['failed', 'error', 'cancelled']:
                    print(f"Evaluation failed with status: {status}")
                    return None
        except Exception as e:
            print(f"Error polling for results: {e}")

        time.sleep(poll_interval)

    print(f"Timeout after {max_wait_time} seconds")
    return None

def format_results(evaluation_runs, current_version):
    """Formats results into a markdown comparison table."""
    if not evaluation_runs:
        return "No evaluation results found."

    version_data = {run.get('version'): run.get('results_summary', {})
                    for run in evaluation_runs}

    # Collect all metrics
    all_metrics = set()
    for run in evaluation_runs:
        for key, value in run.get('results_summary', {}).items():
            if isinstance(value, dict):
                for sub_key in value.keys():
                    all_metrics.add(f"{key}_{sub_key}")
            else:
                all_metrics.add(key)

    comparison_data = []
    for metric in sorted(all_metrics):
        row = {'Metric': metric.replace('_', ' ').title()}
        for version in sorted(version_data.keys()):
            results = version_data[version]
            value = results.get(metric, 'N/A')
            if isinstance(value, float):
                formatted = f"{value:.2f}".rstrip('0').rstrip('.')
            else:
                formatted = str(value)
            label = f"{version} {'(current)' if version == current_version else ''}"
            row[label] = formatted
        comparison_data.append(row)

    df = pd.DataFrame(comparison_data)
    return f"**Current Version:** {current_version}\n\n### Metrics Comparison\n\n{df.to_markdown(index=False)}\n"

def main():
    project_name = os.getenv("PROJECT_NAME", "Voice Agent")
    version = os.getenv("VERSION", "v0.1.0")
    comparison_versions = os.getenv("COMPARISON_VERSIONS", "")

    try:
        evaluator = Evaluator(
            fi_api_key=os.getenv("FI_API_KEY"),
            fi_secret_key=os.getenv("FI_SECRET_KEY")
        )
    except Exception as e:
        post_github_comment(f"## Evaluation Failed\n\n**Reason:** Failed to initialize evaluator: {e}")
        return

    try:
        result = evaluator.evaluate_pipeline(
            project_name=project_name,
            version=version,
            eval_data=eval_data
        )
        if not result.get('status'):
            post_github_comment(f"## Evaluation Failed\n\n**Reason:** {result}")
            return
    except Exception as e:
        post_github_comment(f"## Evaluation Failed\n\n**Reason:** Error submitting evaluation: {e}")
        return

    all_runs = poll_for_completion(evaluator, project_name, version, comparison_versions)

    if not all_runs:
        post_github_comment("## Evaluation Failed\n\n**Reason:** Timed out or failed during processing")
        return

    comment_body = format_results(all_runs, version)
    post_github_comment(comment_body)

if __name__ == "__main__":
    main()

Expected Output

The workflow posts a comment on your PR with the current version identifier and a metrics comparison table across versions.

Evaluation CI/CD Pipeline


Troubleshooting

IssueSolution
GitHub API errors when posting commentsEnsure pull-requests: write permission is set in the workflow. Verify PAT_GITHUB has repository access.
Evaluation fails to submitCheck that FI_API_KEY and FI_SECRET_KEY are correctly configured in GitHub secrets.
Timeout waiting for resultsIncrease max_wait_time in poll_for_completion for complex evaluations. Check network connectivity.
Wrong or missing metricsVerify eval data format matches your templates. Check template names are correct.

Next Steps

Was this page helpful?

Questions & Discussion