Evaluate via CI/CD Pipeline

Run Future AGI evaluations in your CI/CD pipeline to assess model performance on every pull request and keep quality checks consistent before deployment.

What it is

Evaluate via CI/CD is an automated evaluation workflow that integrates Future AGI evaluations into your CI/CD pipeline. It enables consistent quality gating on every pull request, version-based metric comparison across runs, and automated result reporting — driven by two SDK functions: evaluate_pipeline for submitting evals and get_pipeline_results for retrieving and comparing them.


Use cases

  • Gate PRs on quality — Run evals on every PR so regressions in tone, factual consistency, or custom metrics block or flag merges.
  • Compare versions in CI — Submit evaluations with a version tag and compare results across versions (e.g. current branch vs main) in one place.
  • Automate quality reporting — Post eval results (e.g. metrics comparison table) as a PR comment so reviewers see model performance without leaving GitHub.
  • Repeatable checks — Use the same eval templates and inputs in CI so every run is comparable.

How to

The pipeline uses two main SDK functions. Set up the evaluator and your eval data, then submit and retrieve results.

Initialise the evaluator

Create an evaluator instance with your API credentials. Prefer environment variables for secrets.

from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.getenv("FI_API_KEY"),
    fi_secret_key=os.getenv("FI_SECRET_KEY"),
)

Define your evaluation data

Structure a list of eval configs: each with eval_template, model_name, and inputs (keys and lists of values per key). For more on templates and inputs, see Running your first eval.

eval_data = [
    {
        "eval_template": "tone",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "This product is amazing!",
                "I am very disappointed with the service."
            ]
        }
    },
    {
        "eval_template": "groundedness",
        "model_name": "turing_large",
        "inputs": {
            "input": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "context": [
                "What is the capital of France?",
                "Who wrote Hamlet?"
            ],
            "output": [
                "The capital of France is Paris.",
                "William Shakespeare wrote Hamlet."
            ]
        }
    }
]

Submit the evaluation pipeline

Submit your eval data with a project name and version tag.

result = evaluator.evaluate_pipeline(
    project_name="your-project",
    version="v0.1.5",
    eval_data=eval_data,
)

Parameters:

  • project_name — Your project identifier.
  • version — Version tag for this run (e.g. branch or commit).
  • eval_data — List of evaluation configurations (template, model, inputs).

Retrieve results

Get results for one or more versions to compare.

result = evaluator.get_pipeline_results(
    project_name="your-project",
    versions=["v0.1.0", "v0.1.1", "v0.1.5"],
)

Parameters:

  • project_name — Your project identifier.
  • versions — List of version tags to fetch results for.

How to add to your project

Add an evaluation step to your existing CI/CD pipeline so evals run on every build or pull request.

Prerequisites

  • A Future AGI account with API key and secret key.
  • A CI system that can run a Python script (e.g. GitHub Actions, GitLab CI, Jenkins, or any runner with Python and network access).

What you need in your project

  • Credentials — Set FI_API_KEY and FI_SECRET_KEY as environment variables (or in your CI’s secret store). Do not commit them.
  • Project and version — Decide how you identify the run: e.g. PROJECT_NAME (your Future AGI project) and VERSION (branch name, tag, or commit). Pass these into your script.
  • Eval data — The list of eval configs (template, model, inputs) you want to run each time. Define this in your script or load it from config.
  • Script — A small script that: initialises the Evaluator, calls evaluate_pipeline, polls get_pipeline_results until complete (or timeout), then formats the results. Optionally post the results (e.g. as a PR comment or job summary) using your CI’s API. You need the ai-evaluation package; for posting results you may need an HTTP client and the right tokens/permissions for your CI.

Add the step to your pipeline

  1. In your pipeline config, add a job or step that runs on the events you want (e.g. every PR, or on merge to main).
  2. In that job: set the environment variables above, install dependencies (including ai-evaluation), then run your evaluation script.
  3. Use the script’s exit code or output to fail the build when evals fail, or to attach the comparison table to the run (e.g. PR comment, artifact, or log).

Tip

If you use GitHub Actions and want to post results as a PR comment, grant pull-requests: write in the job so the action can create comments.

Pipeline behavior

  1. Trigger — Your pipeline runs (e.g. on PR or push).
  2. Initialise — The script creates the Evaluator with credentials from the environment.
  3. Submit — The script calls evaluate_pipeline with project name, version, and eval data.
  4. Monitor — The script polls get_pipeline_results until the run is completed or times out.
  5. Report — The script formats the results (e.g. metrics comparison across versions) and, if you configured it, posts them (e.g. PR comment, job summary, or artifact).

Expected output: A comparison of eval metrics for the current version and any other versions you requested.

Evaluation CI/CD Pipeline


What you can do next

Was this page helpful?

Questions & Discussion