Overview

This evaluation is designed to evaluate whether the model’s output closely resembles any of the key phrases provided. The metric is especially useful when exact wording may differ but meaning is preserved or the reference is a set of expected keywords.

How Semantic List Contains Evals Work?

  1. Encodes both response and reference text into dense vectors using a SentenceTransformer.
  2. Computes similarity between the response and each phrase using cosine similarity
  3. Compares the result with a configurable threshold (e.g., 0.7)
  4. Returns 1.0 (if exact match) or 0.0 (no match) depending on whether:
    • Any match (match_all = False, default)
    • All match (match_all = True)

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.
Input & Configuration:
ParameterTypeDescription
Required InputsresponsestrModel-generated output to be evaluated
expected_textstr or List[str]A single phrase or list of phrases that the response is expected to semantically include
Optional Configcase_insensitiveboolWhether to lowercase input texts before comparison. Default: True
remove_punctuationboolWhether to strip punctuation from texts. Default: True
match_allboolIf True, all phrases must be semantically present; if False, any one match is enough. Default: False
similarity_thresholdfloatSimilarity threshold for considering a match. Typical range: 0.50.9. Default: 0.7
Output:
Output FieldTypeDescription
scorefloatReturns float between 1.0 and 0.0, closer to 1.0 if match criteria are more satisfied, or closer0.0 otherwise
metadatadictContains similarity values for each phrase, the threshold, and match logic used
result = evaluator.evaluate(
    eval_templates="semantic_list_contains",
    inputs={
        "expected_text": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "response": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
Output:
Score: 1.0
matches [True, False, False]
similarities {'brown fox': 0.6240062713623047, 'lazy dog': 0.5937517639250626, 'dancing giraffe': 0.28756572530065383}
threshold 0.6
match_all False

What if Semantic List Contains Eval Score is Low?

  • Lower the similarity_threshold value (if your use case allows relaxed semantic matches).
  • Use "match_all"= False if partial coverage is acceptable.