Overview

Future AGI’s evaluation platform can be seamlessly integrated with Langfuse, allowing you to attach evaluation results from our AI Evaluations library directly to your Langfuse traces. This enables you to monitor the performance and quality of your LLM applications within the Langfuse UI, correlating evaluation metrics with specific spans and traces.

How it works

When you call evaluator.evaluate() with the platform="langfuse" parameter inside an active Langfuse span, the evaluation is executed. The results are then automatically attached as scores to that specific span in your Langfuse dashboard.

Usage

1. Installation

Before you begin, install the necessary Python packages:

pip install ai-evaluation fi-instrumentation-otel

2. Setup and Initialization

First, you need to set up your environment by initializing both the Langfuse and Future AGI clients.

import os
from langfuse import Langfuse
from fi.evals import Evaluator


# 1. Initialize Langfuse
langfuse = Langfuse(
  secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
  public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
  host=os.getenv("LANGFUSE_HOST")
)

# 2. Initialize the Future AGI Evaluator
evaluator = Evaluator(
    fi_api_key=os.getenv("FI_API_KEY"),
    fi_secret_key=os.getenv("FI_SECRET_KEY"),
)

Make sure you have LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY and LANGFUSE_HOST in your .env file or pass them as arguments while initializing the Evaluator class

evaluator = Evaluator(
    fi_api_key=os.getenv("FI_API_KEY"),
    fi_secret_key=os.getenv("FI_SECRET_KEY"),
    langfuse_secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    langfuse_public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    langfuse_host=os.getenv("LANGFUSE_HOST")
)

3. Configure and Run Evaluations within a Langfuse Span

To link an evaluation to a specific operation in your code, run the evaluation within the context of a Langfuse span. The evaluation result will be automatically linked to that span.

In this example, we will run a levenshtein_similarity evaluation.

# Your application logic, e.g. an LLM call
response_from_llm = "this is a sample response."
expected_response = "this is a sample response."

# Start a Langfuse span
with langfuse.start_as_current_span(
    name="OpenAI call",
    input={"user_query": user_query},
) as span:

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": user_query}
        ]
    )
    
    result = response.choices[0].message.content
    span.update(output={"response": result})

    # Evaluate the tone of the OpenAI response
    evaluator.evaluate(
        eval_templates="tone",
        inputs={
            "input": result
        },
        custom_eval_name="evaluate_tone",
        model_name="turing_large",
        platform="langfuse"
    )

The results will appear as scores for the span in your Langfuse project.

To know more about how to run other evaluations, refer to the evaluations documentation

When calling evaluator.configure_evaluations():

  • platform="langfuse": This essential parameter directs the evaluation results to be sent to Langfuse and linked with the current active span.
  • custom_eval_name: This parameter is required and provides a unique, human-readable name for your evaluation instance. This name will appear on the score in the Langfuse UI, helping you distinguish between different evaluations.
  • eval_config: This dictionary contains the configuration for the evaluation, including the eval_templates to use and the inputs for the evaluation.