String & Similarity
23 local metrics for keyword matching, regex, length checks, BLEU, ROUGE, Levenshtein, and embedding similarity.
- 16 string check metrics (binary 0/1): contains, equals, regex, length, email/link detection
- 7 similarity metrics (continuous 0.0-1.0): BLEU, ROUGE, recall, Levenshtein, numeric, embedding
embedding_similarityandsemantic_list_containsneedpip install ai-evaluation[embeddings]
String and similarity metrics run locally with no LLM calls. String checks return binary pass/fail (0 or 1). Similarity metrics return a continuous score between 0.0 and 1.0.
from fi.evals import evaluate
result = evaluate("contains", output="The meeting is at 3 PM tomorrow.", keyword="meeting")
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "Keyword 'meeting' found"
String Check Metrics
Binary scores: 1 (pass) or 0 (fail). No config required unless noted.
| Metric | What it checks |
|---|---|
contains | Output contains a keyword |
contains_all | Output contains every keyword in a list |
contains_any | Output contains at least one keyword from a list |
contains_none | Output contains none of the forbidden keywords |
contains_email | Output contains an email address |
contains_link | Output contains a URL |
contains_valid_link | Output contains a reachable URL (makes HTTP request) |
is_email | Entire output is a valid email address |
one_line | Output has no newlines |
equals | Exact match with expected_output |
starts_with | Output starts with a given string |
ends_with | Output ends with a given string |
regex | Output matches a regular expression |
length_less_than | Output is under a maximum character count |
length_greater_than | Output exceeds a minimum character count |
length_between | Output length is within a range |
contains
Output contains a keyword. Pass the keyword as a kwarg.
result = evaluate("contains", output="Contact our support team.", keyword="support")
# score → 1.0, reason → "Keyword 'support' found"
contains_all
Output contains every keyword in the list. Fails if any keyword is missing.
result = evaluate(
"contains_all",
output="Order shipped, delivered Friday.",
config={"keywords": ["shipped", "delivered", "Friday"]},
)
# score → 1.0, reason → "All 3 keywords found."
contains_any
Output contains at least one keyword from the list.
result = evaluate(
"contains_any",
output="Pay via credit card or PayPal.",
config={"keywords": ["credit card", "PayPal", "bank transfer"]},
)
# score → 1.0, reason → "Found keywords: credit card, PayPal"
contains_none
Output contains none of the forbidden keywords.
result = evaluate(
"contains_none",
output="Have a great day!",
config={"keywords": ["bad", "evil", "terrible"]},
)
# score → 1.0, reason → "No forbidden keywords found."
contains_email
Output contains at least one email address.
result = evaluate("contains_email", output="Reach us at support@example.com")
# score → 1.0
contains_link
Output contains at least one URL.
result = evaluate("contains_link", output="Visit https://docs.futureagi.com for details.")
# score → 1.0
contains_valid_link
Output contains a URL that responds to an HTTP request. Makes a live network call.
result = evaluate("contains_valid_link", output="More info at https://www.google.com")
# score → 1.0
is_email
Entire output is a valid email address. Fails if the output contains anything else.
result = evaluate("is_email", output="user@example.com")
# score → 1.0
result = evaluate("is_email", output="Contact user@example.com")
# score → 0.0 (not just an email)
one_line
Output contains no newline characters.
result = evaluate("one_line", output="The capital of France is Paris.")
# score → 1.0
equals
Exact string match against expected_output. Case-sensitive.
result = evaluate("equals", output="Paris", expected_output="Paris")
# score → 1.0
result = evaluate("equals", output="paris", expected_output="Paris")
# score → 0.0
starts_with
Output begins with the given string.
result = evaluate("starts_with", output="Summary: The report covers Q3.", keyword="Summary:")
# score → 1.0
ends_with
Output ends with the given string.
result = evaluate("ends_with", output="Thank you for your patience.", keyword="patience.")
# score → 1.0
regex
Output matches a regular expression. Pass the pattern via config.
result = evaluate("regex", output="Order #12345 confirmed.", config={"pattern": r"#\d+"})
# score → 1.0, reason → "Regex pattern '#\d+' found in response."
length_less_than
Output is under a maximum character count.
result = evaluate("length_less_than", output="Yes.", config={"max_length": 100})
# score → 1.0, reason → "Length 4 < 100"
length_greater_than
Output exceeds a minimum character count.
result = evaluate("length_greater_than", output="Hello world", config={"min_length": 5})
# score → 1.0, reason → "Length 11 > 5"
length_between
Output length falls within an inclusive range.
result = evaluate("length_between", output="Hello", config={"min_length": 3, "max_length": 10})
# score → 1.0, reason → "Length 5 is between [3, 10]"
Similarity Metrics
Continuous scores between 0.0 and 1.0. All require expected_output unless noted.
| Metric | What it measures |
|---|---|
bleu_score | N-gram precision between output and expected |
rouge_score | N-gram overlap (recall-oriented) |
recall_score | Word-level recall from expected into output |
levenshtein_similarity | Normalized character edit distance |
numeric_similarity | Numeric value proximity |
embedding_similarity | Semantic similarity via embeddings* |
semantic_list_contains | Semantic keyword match in output* |
*Requires pip install ai-evaluation[embeddings]
bleu_score
BLEU score between output and expected. Measures n-gram precision. Commonly used for translation and summarization.
result = evaluate("bleu_score", output="The cat sat on the mat.", expected_output="The cat is sitting on the mat.")
# score → ~0.42
rouge_score
ROUGE score measuring n-gram overlap. Defaults to rouge1. Set rouge_type in config for other variants.
result = evaluate("rouge_score", output="The cat sat on the mat.", expected_output="The cat is sitting on the mat.")
# score → ~0.77 (rouge1 default)
result = evaluate("rouge_score", output="The cat sat.", expected_output="The cat is sitting.", config={"rouge_type": "rougeL"})
# rouge_type options: "rouge1", "rouge2", "rougeL"
recall_score
Word-level recall: what fraction of words in expected_output appear in output.
result = evaluate("recall_score", output="Paris is the capital of France and a major city.", expected_output="Paris is the capital of France.")
# score → 1.0 (all expected words found)
levenshtein_similarity
Normalized edit distance between two strings. 1.0 = identical, 0.0 = completely different. Character-level.
result = evaluate("levenshtein_similarity", output="kitten", expected_output="sitting")
# score → ~0.57
numeric_similarity
Compares numeric values extracted from output and expected.
result = evaluate("numeric_similarity", output="102", expected_output="100")
# score → ~0.98
embedding_similarity
Semantic similarity via text embeddings. Captures meaning, not just word overlap.
Note
pip install ai-evaluation[embeddings] result = evaluate("embedding_similarity", output="The dog chased the ball.", expected_output="A canine ran after a ball in the garden.")
# score → ~0.91 (semantically similar despite different words)
# Config: similarity_method → "cosine" (default), "euclidean", "manhattan"
result = evaluate("embedding_similarity", output="...", expected_output="...", config={"similarity_method": "euclidean"})
semantic_list_contains
Checks whether the output contains phrases semantically similar to keywords. Uses embeddings.
Note
pip install ai-evaluation[embeddings] result = evaluate(
"semantic_list_contains",
output="Greetings! How can I assist you?",
config={"keywords": ["hello", "help"], "similarity_threshold": 0.7},
)
# score → 1.0 ("Greetings" is semantically close to "hello")
# similarity_threshold: float, default 0.7 — lower = more permissive