Step-by-Step Guide to Creating an Eval Task

1. Set Filters Based on Span Kind

Begin by defining a set of filters to narrow down the data you want to evaluate. Filters can be based on various properties such as:

  • Node Type
  • Created At

These filters help you target specific datasets for evaluation.

2. Choose Data Type

Decide whether you want to run the Evals on:

  • Historic Data: Apply Evals to a specified time range of already-collected data.
  • Continuous Data: Run the evaluation automatically as new data arrives. Recommended for continuous monitoring data in a production environment.

3. Define Sampling Rate

Set a sampling rate to determine the percentage of data to process. A sampling rate of (100%) means all data items are used, whereas (50%) means only half of the available data is used for evaluation. This helps control costs and manage data volume.

4. Set Maximum Number of Spans

Define the maximum number of spans for each evaluation run. This ensures your evaluation scales well and avoids processing excessive amounts of data at once.

5. Select Evals to Run

Choose from a list of preset or previously configured evaluations (Evals) that you want to apply to your filtered data. This selection determines which evaluations will be executed.

For example, if you want to perform a Bias Detection evaluation, each evaluation requires specific keys.

In the case of Bias Detection, an input key is essential. Every span contains key-value pairs, known as span attributes, where the data is stored. You need to supply one of these span attributes as the input. For instance, by passing llm.output_messages.0.message.content as the input, the Bias Detection evaluation will determine whether the content is biased. The evaluation will return Passed if the content is neutral, or Failed if any bias is detected.

For more information on the evaluations we support, please refer to the evals documentation.

6. Run the Task

Once all configurations are set, run the task. You can test the configuration to verify that the Evals and filters are correct before saving the task.