String & Similarity

23 local metrics for keyword matching, regex, length checks, BLEU, ROUGE, Levenshtein, and embedding similarity.

📝
TL;DR
  • 16 string check metrics (binary 0/1): contains, equals, regex, length, email/link detection
  • 7 similarity metrics (continuous 0.0-1.0): BLEU, ROUGE, recall, Levenshtein, numeric, embedding
  • embedding_similarity and semantic_list_contains need pip install ai-evaluation[embeddings]

String and similarity metrics run locally with no LLM calls. String checks return binary pass/fail (0 or 1). Similarity metrics return a continuous score between 0.0 and 1.0.

from fi.evals import evaluate

result = evaluate("contains", output="The meeting is at 3 PM tomorrow.", keyword="meeting")
print(result.score)    # 1.0
print(result.passed)   # True
print(result.reason)   # "Keyword 'meeting' found"

String Check Metrics

Binary scores: 1 (pass) or 0 (fail). No config required unless noted.

MetricWhat it checks
containsOutput contains a keyword
contains_allOutput contains every keyword in a list
contains_anyOutput contains at least one keyword from a list
contains_noneOutput contains none of the forbidden keywords
contains_emailOutput contains an email address
contains_linkOutput contains a URL
contains_valid_linkOutput contains a reachable URL (makes HTTP request)
is_emailEntire output is a valid email address
one_lineOutput has no newlines
equalsExact match with expected_output
starts_withOutput starts with a given string
ends_withOutput ends with a given string
regexOutput matches a regular expression
length_less_thanOutput is under a maximum character count
length_greater_thanOutput exceeds a minimum character count
length_betweenOutput length is within a range

contains

Output contains a keyword. Pass the keyword as a kwarg.

result = evaluate("contains", output="Contact our support team.", keyword="support")
# score → 1.0, reason → "Keyword 'support' found"

contains_all

Output contains every keyword in the list. Fails if any keyword is missing.

result = evaluate(
    "contains_all",
    output="Order shipped, delivered Friday.",
    config={"keywords": ["shipped", "delivered", "Friday"]},
)
# score → 1.0, reason → "All 3 keywords found."

contains_any

Output contains at least one keyword from the list.

result = evaluate(
    "contains_any",
    output="Pay via credit card or PayPal.",
    config={"keywords": ["credit card", "PayPal", "bank transfer"]},
)
# score → 1.0, reason → "Found keywords: credit card, PayPal"

contains_none

Output contains none of the forbidden keywords.

result = evaluate(
    "contains_none",
    output="Have a great day!",
    config={"keywords": ["bad", "evil", "terrible"]},
)
# score → 1.0, reason → "No forbidden keywords found."

contains_email

Output contains at least one email address.

result = evaluate("contains_email", output="Reach us at support@example.com")
# score → 1.0

Output contains at least one URL.

result = evaluate("contains_link", output="Visit https://docs.futureagi.com for details.")
# score → 1.0

Output contains a URL that responds to an HTTP request. Makes a live network call.

result = evaluate("contains_valid_link", output="More info at https://www.google.com")
# score → 1.0

is_email

Entire output is a valid email address. Fails if the output contains anything else.

result = evaluate("is_email", output="user@example.com")
# score → 1.0

result = evaluate("is_email", output="Contact user@example.com")
# score → 0.0 (not just an email)

one_line

Output contains no newline characters.

result = evaluate("one_line", output="The capital of France is Paris.")
# score → 1.0

equals

Exact string match against expected_output. Case-sensitive.

result = evaluate("equals", output="Paris", expected_output="Paris")
# score → 1.0

result = evaluate("equals", output="paris", expected_output="Paris")
# score → 0.0

starts_with

Output begins with the given string.

result = evaluate("starts_with", output="Summary: The report covers Q3.", keyword="Summary:")
# score → 1.0

ends_with

Output ends with the given string.

result = evaluate("ends_with", output="Thank you for your patience.", keyword="patience.")
# score → 1.0

regex

Output matches a regular expression. Pass the pattern via config.

result = evaluate("regex", output="Order #12345 confirmed.", config={"pattern": r"#\d+"})
# score → 1.0, reason → "Regex pattern '#\d+' found in response."

length_less_than

Output is under a maximum character count.

result = evaluate("length_less_than", output="Yes.", config={"max_length": 100})
# score → 1.0, reason → "Length 4 < 100"

length_greater_than

Output exceeds a minimum character count.

result = evaluate("length_greater_than", output="Hello world", config={"min_length": 5})
# score → 1.0, reason → "Length 11 > 5"

length_between

Output length falls within an inclusive range.

result = evaluate("length_between", output="Hello", config={"min_length": 3, "max_length": 10})
# score → 1.0, reason → "Length 5 is between [3, 10]"

Similarity Metrics

Continuous scores between 0.0 and 1.0. All require expected_output unless noted.

MetricWhat it measures
bleu_scoreN-gram precision between output and expected
rouge_scoreN-gram overlap (recall-oriented)
recall_scoreWord-level recall from expected into output
levenshtein_similarityNormalized character edit distance
numeric_similarityNumeric value proximity
embedding_similaritySemantic similarity via embeddings*
semantic_list_containsSemantic keyword match in output*

*Requires pip install ai-evaluation[embeddings]

bleu_score

BLEU score between output and expected. Measures n-gram precision. Commonly used for translation and summarization.

result = evaluate("bleu_score", output="The cat sat on the mat.", expected_output="The cat is sitting on the mat.")
# score → ~0.42

rouge_score

ROUGE score measuring n-gram overlap. Defaults to rouge1. Set rouge_type in config for other variants.

result = evaluate("rouge_score", output="The cat sat on the mat.", expected_output="The cat is sitting on the mat.")
# score → ~0.77 (rouge1 default)

result = evaluate("rouge_score", output="The cat sat.", expected_output="The cat is sitting.", config={"rouge_type": "rougeL"})
# rouge_type options: "rouge1", "rouge2", "rougeL"

recall_score

Word-level recall: what fraction of words in expected_output appear in output.

result = evaluate("recall_score", output="Paris is the capital of France and a major city.", expected_output="Paris is the capital of France.")
# score → 1.0 (all expected words found)

levenshtein_similarity

Normalized edit distance between two strings. 1.0 = identical, 0.0 = completely different. Character-level.

result = evaluate("levenshtein_similarity", output="kitten", expected_output="sitting")
# score → ~0.57

numeric_similarity

Compares numeric values extracted from output and expected.

result = evaluate("numeric_similarity", output="102", expected_output="100")
# score → ~0.98

embedding_similarity

Semantic similarity via text embeddings. Captures meaning, not just word overlap.

Note

Requires pip install ai-evaluation[embeddings]
result = evaluate("embedding_similarity", output="The dog chased the ball.", expected_output="A canine ran after a ball in the garden.")
# score → ~0.91 (semantically similar despite different words)

# Config: similarity_method → "cosine" (default), "euclidean", "manhattan"
result = evaluate("embedding_similarity", output="...", expected_output="...", config={"similarity_method": "euclidean"})

semantic_list_contains

Checks whether the output contains phrases semantically similar to keywords. Uses embeddings.

Note

Requires pip install ai-evaluation[embeddings]
result = evaluate(
    "semantic_list_contains",
    output="Greetings! How can I assist you?",
    config={"keywords": ["hello", "help"], "similarity_threshold": 0.7},
)
# score → 1.0 ("Greetings" is semantically close to "hello")
# similarity_threshold: float, default 0.7 — lower = more permissive
Was this page helpful?

Questions & Discussion