Metrics Reference

Browse all 76+ local evaluation metrics by category. String checks, JSON validation, similarity, hallucination, RAG, agents, structured output, and guardrails.

📝
TL;DR
  • 76+ local metrics, all run in under 1ms with no API key
  • from fi.evals import evaluate then pass any metric name as a string
  • Metrics are grouped by category below — click through for full docs and examples

All local metrics run via the same evaluate() function. Pass the metric name as a string and provide the required inputs as keyword arguments.

from fi.evals import evaluate

result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score)   # 1.0
print(result.passed)  # True
print(result.reason)  # "Keyword 'Hello' found"

Categories

All Metrics (A-Z)

Quick lookup — find any metric by name.

MetricCategoryWhat it checks
action_safetyAgentsWhether agent actions are safe
answer_relevancyRAGHow relevant the answer is to the query
bleu_scoreSimilarityBLEU score between output and reference
citation_presenceRAGWhether sources are cited in the response
claim_supportHallucinationWhether claims are supported by context
containsStringOutput contains a keyword
contains_allStringOutput contains all specified keywords
contains_anyStringOutput contains at least one keyword
contains_emailStringOutput contains an email address
contains_jsonJSONOutput contains valid JSON
contains_linkStringOutput contains a URL
contains_noneStringOutput contains none of the forbidden keywords
contains_valid_linkStringOutput contains a reachable URL
context_entity_recallRAGHow many entities from context appear in the answer
context_precisionRAGPrecision of retrieved context
context_recallRAGHow much relevant context was retrieved
context_relevance_to_responseRAGHow relevant context is to the generated response
context_utilizationRAGHow much of the context was actually used
contradiction_detectionHallucinationWhether output contradicts the context
ends_withStringOutput ends with a specific string
equalsStringOutput exactly matches expected
factual_consistencyHallucinationWhether output is factually consistent with context
faithfulnessHallucinationWhether output is faithful to the provided context
field_completenessStructuredWhether all expected fields are present
field_coverageStructuredPercentage of expected fields that are filled
function_call_accuracyAgentsWhether function calls are correct
function_call_exact_matchAgentsExact match of function call with expected
function_name_matchAgentsWhether the correct function was called
goal_progressAgentsHow much progress was made toward the goal
groundednessRAGWhether the response is grounded in context
hallucination_scoreHallucinationOverall hallucination score
hierarchy_scoreStructuredHow well nested structure matches expected
is_emailStringOutput is a valid email address
is_jsonJSONOutput is valid JSON
json_schemaJSONOutput matches a JSON schema
json_syntaxJSONOutput has correct JSON syntax
json_validationJSONOutput passes JSON validation rules
length_betweenStringOutput length is within a range
length_greater_thanStringOutput exceeds a minimum length
length_less_thanStringOutput is under a maximum length
levenshtein_similaritySimilarityEdit distance similarity between texts
mrrRAGMean Reciprocal Rank of retrieved results
multi_hop_reasoningRAGWhether multi-step reasoning is correct
ndcgRAGNormalized Discounted Cumulative Gain
noise_sensitivityRAGHow sensitive retrieval is to noisy input
numeric_similaritySimilaritySimilarity between numeric values
one_lineStringOutput is a single line
parameter_validationAgentsWhether function parameters are correct
pii_detectionGuardrailsDetects personally identifiable information
precision_at_kRAGPrecision at rank K
prompt_injectionGuardrailsDetects prompt injection attempts
quick_structured_checkStructuredFast basic structure validation
rag_faithfulnessRAGFaithfulness specific to RAG pipelines
rag_faithfulness_with_referenceRAGRAG faithfulness with reference answer
rag_scoreRAGComposite RAG quality score
rag_score_detailedRAGComposite RAG score with per-metric breakdown
reasoning_qualityAgentsQuality of the agent’s reasoning chain
recall_at_kRAGRecall at rank K
recall_scoreSimilarityRecall between output and reference
regexStringOutput matches a regex pattern
required_fieldsStructuredWhether required fields are present
rouge_scoreSimilarityROUGE score between output and reference
schema_complianceStructuredWhether output matches a schema
secret_detectionGuardrailsDetects API keys, passwords, tokens
source_attributionRAGWhether sources are properly attributed
sql_injectionGuardrailsDetects SQL injection attempts
starts_withStringOutput starts with a specific string
step_efficiencyAgentsWhether the agent used minimal steps
structured_output_scoreStructuredOverall structured output quality
task_completionAgentsWhether the agent completed the task
tool_selection_accuracyAgentsWhether the right tools were selected
trajectory_scoreAgentsOverall agent trajectory quality
tree_edit_distanceStructuredEdit distance between output and expected structure
type_complianceStructuredWhether field types match expected types
Was this page helpful?

Questions & Discussion