Evaluate via CI/CD Pipeline
Run Future AGI evaluations in your CI/CD pipeline to assess model performance on every pull request and keep quality checks consistent before deployment.
What it is
Evaluate via CI/CD is an automated evaluation workflow that integrates Future AGI evaluations into your CI/CD pipeline. It enables consistent quality gating on every pull request, version-based metric comparison across runs, and automated result reporting — driven by two SDK functions: evaluate_pipeline for submitting evals and get_pipeline_results for retrieving and comparing them.
Use cases
- Gate PRs on quality — Run evals on every PR so regressions in tone, factual consistency, or custom metrics block or flag merges.
- Compare versions in CI — Submit evaluations with a version tag and compare results across versions (e.g. current branch vs main) in one place.
- Automate quality reporting — Post eval results (e.g. metrics comparison table) as a PR comment so reviewers see model performance without leaving GitHub.
- Repeatable checks — Use the same eval templates and inputs in CI so every run is comparable.
How to
The pipeline uses two main SDK functions. Set up the evaluator and your eval data, then submit and retrieve results.
Initialise the evaluator
Create an evaluator instance with your API credentials. Prefer environment variables for secrets.
from fi.evals import Evaluator
evaluator = Evaluator(
fi_api_key=os.getenv("FI_API_KEY"),
fi_secret_key=os.getenv("FI_SECRET_KEY"),
) Define your evaluation data
Structure a list of eval configs: each with eval_template, model_name, and inputs (keys and lists of values per key). For more on templates and inputs, see Running your first eval.
eval_data = [
{
"eval_template": "tone",
"model_name": "turing_large",
"inputs": {
"input": [
"This product is amazing!",
"I am very disappointed with the service."
]
}
},
{
"eval_template": "groundedness",
"model_name": "turing_large",
"inputs": {
"input": [
"What is the capital of France?",
"Who wrote Hamlet?"
],
"context": [
"What is the capital of France?",
"Who wrote Hamlet?"
],
"output": [
"The capital of France is Paris.",
"William Shakespeare wrote Hamlet."
]
}
}
] Submit the evaluation pipeline
Submit your eval data with a project name and version tag.
result = evaluator.evaluate_pipeline(
project_name="your-project",
version="v0.1.5",
eval_data=eval_data,
)Parameters:
- project_name — Your project identifier.
- version — Version tag for this run (e.g. branch or commit).
- eval_data — List of evaluation configurations (template, model, inputs).
Retrieve results
Get results for one or more versions to compare.
result = evaluator.get_pipeline_results(
project_name="your-project",
versions=["v0.1.0", "v0.1.1", "v0.1.5"],
)Parameters:
- project_name — Your project identifier.
- versions — List of version tags to fetch results for.
How to add to your project
Add an evaluation step to your existing CI/CD pipeline so evals run on every build or pull request.
Prerequisites
- A Future AGI account with API key and secret key.
- A CI system that can run a Python script (e.g. GitHub Actions, GitLab CI, Jenkins, or any runner with Python and network access).
What you need in your project
- Credentials — Set
FI_API_KEYandFI_SECRET_KEYas environment variables (or in your CI’s secret store). Do not commit them. - Project and version — Decide how you identify the run: e.g.
PROJECT_NAME(your Future AGI project) andVERSION(branch name, tag, or commit). Pass these into your script. - Eval data — The list of eval configs (template, model, inputs) you want to run each time. Define this in your script or load it from config.
- Script — A small script that: initialises the
Evaluator, calls evaluate_pipeline, polls get_pipeline_results until complete (or timeout), then formats the results. Optionally post the results (e.g. as a PR comment or job summary) using your CI’s API. You need the ai-evaluation package; for posting results you may need an HTTP client and the right tokens/permissions for your CI.
Add the step to your pipeline
- In your pipeline config, add a job or step that runs on the events you want (e.g. every PR, or on merge to main).
- In that job: set the environment variables above, install dependencies (including ai-evaluation), then run your evaluation script.
- Use the script’s exit code or output to fail the build when evals fail, or to attach the comparison table to the run (e.g. PR comment, artifact, or log).
Tip
If you use GitHub Actions and want to post results as a PR comment, grant pull-requests: write in the job so the action can create comments.
Pipeline behavior
- Trigger — Your pipeline runs (e.g. on PR or push).
- Initialise — The script creates the
Evaluatorwith credentials from the environment. - Submit — The script calls evaluate_pipeline with project name, version, and eval data.
- Monitor — The script polls get_pipeline_results until the run is completed or times out.
- Report — The script formats the results (e.g. metrics comparison across versions) and, if you configured it, posts them (e.g. PR comment, job summary, or artifact).
Expected output: A comparison of eval metrics for the current version and any other versions you requested.

What you can do next
Evaluate via Platform & SDK
Run a single eval from the UI or SDK.
Create custom evals
Define eval templates to use in your pipeline.
Eval groups
Run multiple evals together as a group.
Use custom models
Bring your own model for evaluations.
Future AGI models
Built-in models available for evals.
Evaluation overview
How evaluation fits into the platform.