Metrics Reference: 76+ Local Evaluation Metrics SDK

Browse all 76+ evaluation metrics by category: string matching, JSON validation, hallucination, RAG quality, agent trajectories, and guardrail checks.

📝

TL;DR

76+ local metrics, all run in under 1ms with no API key
from fi.evals import evaluate then pass any metric name as a string
Metrics are grouped by category below — click through for full docs and examples

All local metrics run via the same evaluate() function. Pass the metric name as a string and provide the required inputs as keyword arguments.

from fi.evals import evaluate

result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score)   # 1.0
print(result.passed)  # True
print(result.reason)  # "Keyword 'Hello' found"

All Metrics (A-Z)

Quick lookup — find any metric by name.

Metric	Category	What it checks
`action_safety`	Agents	Whether agent actions are safe
`answer_relevancy`	RAG	How relevant the answer is to the query
`bleu_score`	Similarity	BLEU score between output and reference
`citation_presence`	RAG	Whether sources are cited in the response
`claim_support`	Hallucination	Whether claims are supported by context
`contains`	String	Output contains a keyword
`contains_all`	String	Output contains all specified keywords
`contains_any`	String	Output contains at least one keyword
`contains_email`	String	Output contains an email address
`contains_json`	JSON	Output contains valid JSON
`contains_link`	String	Output contains a URL
`contains_none`	String	Output contains none of the forbidden keywords
`contains_valid_link`	String	Output contains a reachable URL
`context_entity_recall`	RAG	How many entities from context appear in the answer
`context_precision`	RAG	Precision of retrieved context
`context_recall`	RAG	How much relevant context was retrieved
`context_relevance_to_response`	RAG	How relevant context is to the generated response
`context_utilization`	RAG	How much of the context was actually used
`contradiction_detection`	Hallucination	Whether output contradicts the context
`ends_with`	String	Output ends with a specific string
`equals`	String	Output exactly matches expected
`factual_consistency`	Hallucination	Whether output is factually consistent with context
`faithfulness`	Hallucination	Whether output is faithful to the provided context
`field_completeness`	Structured	Whether all expected fields are present
`field_coverage`	Structured	Percentage of expected fields that are filled
`function_call_accuracy`	Agents	Whether function calls are correct
`function_call_exact_match`	Agents	Exact match of function call with expected
`function_name_match`	Agents	Whether the correct function was called
`goal_progress`	Agents	How much progress was made toward the goal
`groundedness`	RAG	Whether the response is grounded in context
`hallucination_score`	Hallucination	Overall hallucination score
`hierarchy_score`	Structured	How well nested structure matches expected
`is_email`	String	Output is a valid email address
`is_json`	JSON	Output is valid JSON
`json_schema`	JSON	Output matches a JSON schema
`json_syntax`	JSON	Output has correct JSON syntax
`json_validation`	JSON	Output passes JSON validation rules
`length_between`	String	Output length is within a range
`length_greater_than`	String	Output exceeds a minimum length
`length_less_than`	String	Output is under a maximum length
`levenshtein_similarity`	Similarity	Edit distance similarity between texts
`mrr`	RAG	Mean Reciprocal Rank of retrieved results
`multi_hop_reasoning`	RAG	Whether multi-step reasoning is correct
`ndcg`	RAG	Normalized Discounted Cumulative Gain
`noise_sensitivity`	RAG	How sensitive retrieval is to noisy input
`numeric_similarity`	Similarity	Similarity between numeric values
`one_line`	String	Output is a single line
`parameter_validation`	Agents	Whether function parameters are correct
`pii_detection`	Guardrails	Detects personally identifiable information
`precision_at_k`	RAG	Precision at rank K
`prompt_injection`	Guardrails	Detects prompt injection attempts
`quick_structured_check`	Structured	Fast basic structure validation
`rag_faithfulness`	RAG	Faithfulness specific to RAG pipelines
`rag_faithfulness_with_reference`	RAG	RAG faithfulness with reference answer
`rag_score`	RAG	Composite RAG quality score
`rag_score_detailed`	RAG	Composite RAG score with per-metric breakdown
`reasoning_quality`	Agents	Quality of the agent’s reasoning chain
`recall_at_k`	RAG	Recall at rank K
`recall_score`	Similarity	Recall between output and reference
`regex`	String	Output matches a regex pattern
`required_fields`	Structured	Whether required fields are present
`rouge_score`	Similarity	ROUGE score between output and reference
`schema_compliance`	Structured	Whether output matches a schema
`secret_detection`	Guardrails	Detects API keys, passwords, tokens
`source_attribution`	RAG	Whether sources are properly attributed
`sql_injection`	Guardrails	Detects SQL injection attempts
`starts_with`	String	Output starts with a specific string
`step_efficiency`	Agents	Whether the agent used minimal steps
`structured_output_score`	Structured	Overall structured output quality
`task_completion`	Agents	Whether the agent completed the task
`tool_selection_accuracy`	Agents	Whether the right tools were selected
`trajectory_score`	Agents	Overall agent trajectory quality
`tree_edit_distance`	Structured	Edit distance between output and expected structure
`type_compliance`	Structured	Whether field types match expected types

Running Evaluations

Full evaluate() API reference.

Cloud Evals

100+ pre-built Turing templates.

LLM-as-Judge

Custom criteria with any model.

Was this page helpful?

Questions & Discussion

Metrics Reference: 76+ Local Evaluation Metrics SDK

Categories

String & Similarity

JSON & Structured Output

Hallucination

RAG

Agents & Function Calling

Guardrails

All Metrics (A-Z)

Running Evaluations

Cloud Evals

LLM-as-Judge

Categories

String & Similarity

JSON & Structured Output

Hallucination

RAG

Agents & Function Calling

Guardrails

All Metrics (A-Z)

Related

Running Evaluations

Cloud Evals

LLM-as-Judge