Metrics Reference
Browse all 76+ local evaluation metrics by category. String checks, JSON validation, similarity, hallucination, RAG, agents, structured output, and guardrails.
- 76+ local metrics, all run in under 1ms with no API key
from fi.evals import evaluatethen pass any metric name as a string- Metrics are grouped by category below — click through for full docs and examples
All local metrics run via the same evaluate() function. Pass the metric name as a string and provide the required inputs as keyword arguments.
from fi.evals import evaluate
result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "Keyword 'Hello' found"
Categories
String & Similarity
23 metrics — keyword matching, regex, length checks, BLEU, ROUGE, Levenshtein, and embedding similarity.
JSON & Structured Output
14 metrics — JSON validation, schema compliance, field completeness, type checking, and hierarchy scoring.
Hallucination
5 metrics — faithfulness, claim support, factual consistency, contradiction detection. Supports LLM augmentation.
RAG
19 metrics — context recall, precision, answer relevancy, groundedness, multi-hop reasoning, and composite RAG scores.
Agents & Function Calling
11 metrics — task completion, tool selection, trajectory scoring, function call validation, and reasoning quality.
Guardrails
4+ scanners — prompt injection, PII detection, secret detection, SQL injection. All run in under 10ms.
All Metrics (A-Z)
Quick lookup — find any metric by name.
| Metric | Category | What it checks |
|---|---|---|
action_safety | Agents | Whether agent actions are safe |
answer_relevancy | RAG | How relevant the answer is to the query |
bleu_score | Similarity | BLEU score between output and reference |
citation_presence | RAG | Whether sources are cited in the response |
claim_support | Hallucination | Whether claims are supported by context |
contains | String | Output contains a keyword |
contains_all | String | Output contains all specified keywords |
contains_any | String | Output contains at least one keyword |
contains_email | String | Output contains an email address |
contains_json | JSON | Output contains valid JSON |
contains_link | String | Output contains a URL |
contains_none | String | Output contains none of the forbidden keywords |
contains_valid_link | String | Output contains a reachable URL |
context_entity_recall | RAG | How many entities from context appear in the answer |
context_precision | RAG | Precision of retrieved context |
context_recall | RAG | How much relevant context was retrieved |
context_relevance_to_response | RAG | How relevant context is to the generated response |
context_utilization | RAG | How much of the context was actually used |
contradiction_detection | Hallucination | Whether output contradicts the context |
ends_with | String | Output ends with a specific string |
equals | String | Output exactly matches expected |
factual_consistency | Hallucination | Whether output is factually consistent with context |
faithfulness | Hallucination | Whether output is faithful to the provided context |
field_completeness | Structured | Whether all expected fields are present |
field_coverage | Structured | Percentage of expected fields that are filled |
function_call_accuracy | Agents | Whether function calls are correct |
function_call_exact_match | Agents | Exact match of function call with expected |
function_name_match | Agents | Whether the correct function was called |
goal_progress | Agents | How much progress was made toward the goal |
groundedness | RAG | Whether the response is grounded in context |
hallucination_score | Hallucination | Overall hallucination score |
hierarchy_score | Structured | How well nested structure matches expected |
is_email | String | Output is a valid email address |
is_json | JSON | Output is valid JSON |
json_schema | JSON | Output matches a JSON schema |
json_syntax | JSON | Output has correct JSON syntax |
json_validation | JSON | Output passes JSON validation rules |
length_between | String | Output length is within a range |
length_greater_than | String | Output exceeds a minimum length |
length_less_than | String | Output is under a maximum length |
levenshtein_similarity | Similarity | Edit distance similarity between texts |
mrr | RAG | Mean Reciprocal Rank of retrieved results |
multi_hop_reasoning | RAG | Whether multi-step reasoning is correct |
ndcg | RAG | Normalized Discounted Cumulative Gain |
noise_sensitivity | RAG | How sensitive retrieval is to noisy input |
numeric_similarity | Similarity | Similarity between numeric values |
one_line | String | Output is a single line |
parameter_validation | Agents | Whether function parameters are correct |
pii_detection | Guardrails | Detects personally identifiable information |
precision_at_k | RAG | Precision at rank K |
prompt_injection | Guardrails | Detects prompt injection attempts |
quick_structured_check | Structured | Fast basic structure validation |
rag_faithfulness | RAG | Faithfulness specific to RAG pipelines |
rag_faithfulness_with_reference | RAG | RAG faithfulness with reference answer |
rag_score | RAG | Composite RAG quality score |
rag_score_detailed | RAG | Composite RAG score with per-metric breakdown |
reasoning_quality | Agents | Quality of the agent’s reasoning chain |
recall_at_k | RAG | Recall at rank K |
recall_score | Similarity | Recall between output and reference |
regex | String | Output matches a regex pattern |
required_fields | Structured | Whether required fields are present |
rouge_score | Similarity | ROUGE score between output and reference |
schema_compliance | Structured | Whether output matches a schema |
secret_detection | Guardrails | Detects API keys, passwords, tokens |
source_attribution | RAG | Whether sources are properly attributed |
sql_injection | Guardrails | Detects SQL injection attempts |
starts_with | String | Output starts with a specific string |
step_efficiency | Agents | Whether the agent used minimal steps |
structured_output_score | Structured | Overall structured output quality |
task_completion | Agents | Whether the agent completed the task |
tool_selection_accuracy | Agents | Whether the right tools were selected |
trajectory_score | Agents | Overall agent trajectory quality |
tree_edit_distance | Structured | Edit distance between output and expected structure |
type_compliance | Structured | Whether field types match expected types |