CI/CD Eval Pipeline: Automate Quality Gates on Every PR

Run automated faithfulness and toxicity eval gates on every pull request using Future AGI. Block merges when scores fall below configured thresholds.

📝

TL;DR

Use FutureAGI’s CI/CD Eval Pipeline to automatically run faithfulness and toxicity evals as quality gates on every PR, blocking merges when scores fall below threshold.

Time	Difficulty	Package
15 min	Intermediate	`ai-evaluation`

By the end of this guide you will have a GitHub Actions workflow that runs faithfulness and toxicity evals on every PR, posts a pass/fail summary as a PR comment, and blocks merges when scores fall below threshold.

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
A GitHub repository with Actions enabled
Python 3.9+

Why eval in CI/CD?

Prompts change. Models drift. When you update a system prompt or swap a model, you want to know immediately if response quality dropped before it reaches users. A CI/CD eval pipeline catches regressions at review time with the same rigor you apply to code tests.

Create the eval script

Create scripts/evaluate_pipeline.py in your repository. This script runs evals on a fixed test dataset and exits with a non-zero code if any metric falls below threshold; this causes the GitHub Actions step to fail.

#!/usr/bin/env python3
"""
Evaluation pipeline for CI/CD.
Exit code 0 = all evals passed. Exit code 1 = one or more evals failed.
"""
import os
import sys
from openai import OpenAI
from fi.evals import evaluate

FI_API_KEY    = os.environ["FI_API_KEY"]
FI_SECRET_KEY = os.environ["FI_SECRET_KEY"]

client = OpenAI()

# Thresholds - adjust to match your quality bar
FAITHFULNESS_THRESHOLD = 0.85
TOXICITY_THRESHOLD     = 0.90  # toxicity score: higher = safer

# Your system prompt — replace with your actual production prompt
SYSTEM_PROMPT = """You are a customer support agent for an electronics retailer.
Answer questions accurately using only the information provided in the context.
Be concise and helpful. If you are unsure, say so rather than guessing."""

# Test dataset - question + expected grounding context
TEST_CASES = [
    {
        "question": "What is the return window for electronics?",
        "context":  "Electronics may be returned within 30 days of purchase with original packaging.",
    },
    {
        "question": "How long does standard shipping take?",
        "context":  "Standard shipping takes 5-7 business days within the continental US.",
    },
    {
        "question": "Can I return a product bought on sale?",
        "context":  "Sale items are eligible for exchange only. Full refunds are not available on sale purchases.",
    },
    {
        "question": "What payment methods do you accept?",
        "context":  "We accept Visa, Mastercard, American Express, PayPal, and Apple Pay.",
    },
    {
        "question": "Do you offer international shipping?",
        "context":  "International shipping is available to 45 countries. Delivery takes 10-21 business days.",
    },
]


def run_evals() -> bool:
    all_passed = True
    results = []

    print(f"\n{'Question':<45} {'Faithfulness':>14} {'Toxicity':>10} {'Status':>8}")
    print("-" * 81)

    for case in TEST_CASES:
        # Generate response from the agent
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user",   "content": case["question"]},
            ],
        )
        output = response.choices[0].message.content

        # Run evals
        faithfulness = evaluate(
            "faithfulness",
            output=output,
            context=case["context"],
        )
        toxicity = evaluate(
            "toxicity",
            output=output,
            model="turing_small",
        )

        faith_pass = faithfulness.score >= FAITHFULNESS_THRESHOLD
        toxic_pass = toxicity.score >= TOXICITY_THRESHOLD
        row_passed = faith_pass and toxic_pass

        if not row_passed:
            all_passed = False

        status = "PASS" if row_passed else "FAIL"
        print(
            f"{case['question'][:43]:<45} "
            f"{faithfulness.score:>14.2f} "
            f"{toxicity.score:>10.2f} "
            f"{status:>8}"
        )

        results.append({
            "question":    case["question"],
            "faithfulness": faithfulness.score,
            "toxicity":    toxicity.score,
            "passed":      row_passed,
        })

    passed_count = sum(1 for r in results if r["passed"])
    print(f"\nResult: {passed_count}/{len(results)} test cases passed.")
    print(f"Faithfulness threshold: {FAITHFULNESS_THRESHOLD}")
    print(f"Toxicity threshold:     {TOXICITY_THRESHOLD}")

    return all_passed


if __name__ == "__main__":
    passed = run_evals()
    sys.exit(0 if passed else 1)

Create the GitHub Actions workflow

Create .github/workflows/eval.yml:

name: Eval Pipeline

on:
  pull_request:
    branches: [main, dev]
    paths:
      - "prompts/**"       # run evals when prompts change
      - "scripts/**"       # run evals when eval scripts change

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install ai-evaluation openai

      - name: Run eval pipeline
        env:
          FI_API_KEY:    ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/evaluate_pipeline.py

      - name: Post results as PR comment
        if: always()   # post even if the eval step failed
        uses: actions/github-script@v7
        with:
          script: |
            const outcome = '${{ job.status }}';
            const status  = outcome === 'success' ? '✅ All evals passed' : '❌ Evals failed - merge blocked';
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Eval Pipeline Results\n\n${status}\n\nSee the [Actions run](${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}) for full output.`,
            });

Add secrets to GitHub

Go to your GitHub repository → Settings → Secrets and variables → Actions → New repository secret.

Add three secrets:

FI_API_KEY - your FutureAGI API key
FI_SECRET_KEY - your FutureAGI secret key
OPENAI_API_KEY - your OpenAI API key

Trigger the pipeline

Open a pull request that modifies a file in prompts/. The workflow triggers automatically.

When a PR introduces a prompt change that hurts quality:

The run eval pipeline step exits with code 1
GitHub marks the check as failed
The PR cannot be merged (if branch protection rules are enabled)

Enable branch protection (recommended)

Go to your GitHub repository → Settings → Branches → Add rule.

Branch name pattern: main
Check: Require status checks to pass before merging
Add: Eval Pipeline / evaluate to the required checks list

Now the eval must pass before any PR can merge to main.

What you built

You now have a CI/CD pipeline that automatically evaluates LLM outputs on every pull request and blocks merges when quality drops.

Created evaluate_pipeline.py that runs faithfulness and toxicity evals on 5 test cases
Built a GitHub Actions workflow that triggers on prompt changes, runs evals, and posts a PR comment
Added FutureAGI and OpenAI secrets to GitHub
Enabled branch protection so failing evals block merges

Questions & Discussion

CI/CD Eval Pipeline: Automate Quality Gates on Every PR

Why eval in CI/CD?

Create the eval script

Create the GitHub Actions workflow

Add secrets to GitHub

Trigger the pipeline

Enable branch protection (recommended)

What you built

Next steps

Running Your First Eval

Custom Eval Metrics

Experimentation

Prompt Versioning