Overview

This evaluation is designed to evaluate whether the model’s output closely resembles any of the key phrases provided. The metric is especially useful when exact wording may differ but meaning is preserved or the reference is a set of expected keywords.

How Semantic List Contains Evals Work?

  1. Encodes both response and reference text into dense vectors using a SentenceTransformer.
  2. Computes similarity between the response and each phrase using cosine similarity
  3. Compares the result with a configurable threshold (e.g., 0.7)
  4. Returns 1.0 (if exact match) or 0.0 (no match) depending on whether:
    • Any match (match_all = False, default)
    • All match (match_all = True)

Semantic List Contains Eval using Future AGI’s Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input & Configuration:

ParameterTypeDescription
Required InputsresponsestrModel-generated output to be evaluated
expected_textstr or List[str]A single phrase or list of phrases that the response is expected to semantically include
Optional Configcase_insensitiveboolWhether to lowercase input texts before comparison. Default: True
remove_punctuationboolWhether to strip punctuation from texts. Default: True
match_allboolIf True, all phrases must be semantically present; if False, any one match is enough. Default: False
similarity_thresholdfloatSimilarity threshold for considering a match. Typical range: 0.50.9. Default: 0.7

Output:

Output FieldTypeDescription
scorefloatReturns float between 1.0 and 0.0, closer to 1.0 if match criteria are more satisfied, or closer0.0 otherwise
metadatadictContains similarity values for each phrase, the threshold, and match logic used

Example:

from fi.evals.metrics import SemanticListContains
from fi.testcases import TestCase
import json

test_case = TestCase(
    response="The quick brown fox jumps over the lazy dog.",
    expected_text=json.dumps(["brown fox", "lazy dog", "dancing giraffe"])
)

evaluator = SemanticListContains(config={
    "similarity_threshold": 0.6,
    "match_all": False
})

result = evaluator.evaluate([test_case])
score = result.eval_results[0].metrics[0].value
metadata = result.eval_results[0].metadata

print(f"Score: {score}")
print("Similarity Details:", metadata)

Output:

Score: 1.0
matches [True, False, False]
similarities {'brown fox': 0.6240062713623047, 'lazy dog': 0.5937517639250626, 'dancing giraffe': 0.28756572530065383}
threshold 0.6
match_all False

What if Semantic List Contains Eval Score is Low?

  • Lower the similarity_threshold value (if your use case allows relaxed semantic matches).
  • Use "match_all"= False if partial coverage is acceptable.