Local & Hybrid Evaluation

Run evaluations locally with zero API calls. Auto-route between local and cloud metrics. Use Ollama for offline LLM-based scoring.

📝
TL;DR
  • LocalEvaluator runs 26+ metrics locally - zero latency, zero cost, no API key needed
  • HybridEvaluator auto-routes: local metrics stay local, cloud metrics go to Turing
  • OllamaLLM runs LLM-based metrics (coherence, relevance, etc.) entirely offline

Not every evaluation needs a round-trip to the cloud. String checks, JSON validation, BLEU scores, embedding similarity - these run locally in under 1ms. The local module gives you a LocalEvaluator for pure-local execution, and a HybridEvaluator that automatically routes each metric to the right engine.

Note

Requires pip install ai-evaluation. For offline LLM-based metrics, you also need Ollama running locally.

Quick Example

from fi.evals.local import LocalEvaluator

evaluator = LocalEvaluator()

# Zero API calls, sub-millisecond
result = evaluator.evaluate("is_json", [{"response": '{"status": "ok"}'}])
print(result.results.eval_results[0].output)  # 1.0
print(result.executed_locally)                 # {"is_json"}

LocalEvaluator

Runs metrics that don’t need any external service.

from fi.evals.local import LocalEvaluator, LocalEvaluatorConfig, RoutingMode

evaluator = LocalEvaluator(
    config=LocalEvaluatorConfig(
        execution_mode=RoutingMode.LOCAL,  # LOCAL, CLOUD, or HYBRID
        fail_on_unsupported=False,         # skip unsupported metrics instead of erroring
        parallel_workers=4,                # concurrent evaluations
        timeout=60,                        # seconds per evaluation
    )
)

Single metric

result = evaluator.evaluate(
    "bleu_score",
    [{"response": "the cat sat", "expected_response": "the cat sat on the mat"}],
)

for r in result.results.eval_results:
    print(f"{r.name}: {r.output:.3f}")  # bleu_score: 0.207
print(f"Ran locally: {result.executed_locally}")  # {"bleu_score"}

With config

Some metrics need configuration:

result = evaluator.evaluate(
    "contains",
    [{"response": "The API returned a 200 OK status"}],
    config={"keyword": "200 OK"},
)
# contains: 1.0

Batch evaluation

Run multiple metrics in one call:

result = evaluator.evaluate_batch([
    {"metric_name": "is_json", "inputs": [{"response": '{"valid": true}'}]},
    {"metric_name": "one_line", "inputs": [{"response": "single line output"}]},
    {"metric_name": "contains", "inputs": [{"response": "hello world"}], "config": {"keyword": "hello"}},
    {"metric_name": "bleu_score", "inputs": [{"response": "the cat", "expected_response": "the cat sat"}]},
])

for r in result.results.eval_results:
    print(f"{r.name}: {r.output}")
print(f"All local: {result.executed_locally}")  # {"is_json", "one_line", "contains", "bleu_score"}

Check what runs locally

from fi.evals.local import can_run_locally, LOCAL_CAPABLE_METRICS

# Check a specific metric
print(can_run_locally("bleu_score"))   # True
print(can_run_locally("toxicity"))     # False - needs cloud

# See all local-capable metrics
print(LOCAL_CAPABLE_METRICS)
# {"bleu_score", "contains", "contains_all", "contains_any", "contains_email",
#  "contains_json", "contains_link", "contains_none", "contains_valid_link",
#  "embedding_similarity", "ends_with", "equals", "is_email", "is_json",
#  "json_schema", "length_between", "length_greater_than", "length_less_than",
#  "levenshtein_similarity", "numeric_similarity", "one_line", "recall_score",
#  "regex", "rouge_score", "semantic_list_contains", "starts_with"}

# List all available metrics (includes registry-registered beyond LOCAL_CAPABLE_METRICS)
evaluator = LocalEvaluator()
print(len(evaluator.list_available_metrics()))  # 72

Note

LOCAL_CAPABLE_METRICS is the guaranteed-local set (26 string/JSON/similarity metrics). The registry has 72+ metrics total - including RAG, agents, structured output, and hallucination metrics that also run locally through the registry but aren’t in the LOCAL_CAPABLE_METRICS heuristic set. Use list_available_metrics() to see everything the LocalEvaluator can run.

HybridEvaluator

Auto-routes metrics to the best execution engine. Local metrics run locally, cloud metrics go to Turing, and LLM-based metrics can optionally run through Ollama.

from fi.evals.local import HybridEvaluator

evaluator = HybridEvaluator(
    prefer_local=True,         # prefer local execution when possible
    fallback_to_cloud=True,    # fall back to cloud if local fails
    offline_mode=False,        # True = block all cloud calls
)

Auto-routing

from fi.evals.local import HybridEvaluator, RoutingMode

evaluator = HybridEvaluator()

# Check where a metric will run
print(evaluator.route_evaluation("is_json"))       # RoutingMode.LOCAL
print(evaluator.route_evaluation("toxicity"))       # RoutingMode.CLOUD
print(evaluator.route_evaluation("faithfulness"))   # RoutingMode.CLOUD

# Force routing
print(evaluator.route_evaluation("is_json", force_cloud=True))  # RoutingMode.CLOUD

Partition evaluations

Split a batch into local vs cloud groups:

evaluator = HybridEvaluator()

evaluations = [
    {"metric_name": "is_json", "inputs": [{"response": "{}"}]},
    {"metric_name": "toxicity", "inputs": [{"response": "hello"}]},
    {"metric_name": "bleu_score", "inputs": [{"response": "test", "expected_response": "test"}]},
]

partitioned = evaluator.partition_evaluations(evaluations)
for mode, evals in partitioned.items():
    print(f"{mode.value}: {[e['metric_name'] for e in evals]}")
# local: ["is_json", "bleu_score"]
# cloud: ["toxicity"]

Evaluate

# Runs locally if possible, falls back to cloud
result = evaluator.evaluate("is_json", [{"response": '{"key": "value"}'}])
print(result.results.eval_results[0].output)  # 1.0

Note

HybridEvaluator.evaluate() takes the metric name as its first positional argument (parameter is named template internally). Always pass it positionally - evaluate("is_json", ...) - not as a keyword argument.

Offline mode

Block all cloud calls - useful for air-gapped environments or CI pipelines without API keys:

evaluator = HybridEvaluator(offline_mode=True)

# Local metrics work fine
result = evaluator.evaluate("is_json", [{"response": "{}"}])  # works

# Cloud metrics raise ValueError
try:
    evaluator.route_evaluation("toxicity")
except ValueError as e:
    print(e)  # "Metric 'toxicity' requires cloud execution but offline_mode is enabled"

Ollama Integration

Run LLM-based metrics locally using Ollama. No API keys, no cloud calls - everything stays on your machine.

Setup

from fi.evals.local import OllamaLLM, LocalLLMConfig

# Default config - connects to localhost:11434, uses llama3.2
llm = OllamaLLM()

# Custom config
llm = OllamaLLM(config=LocalLLMConfig(
    model="llama3.2:3b",
    base_url="http://localhost:11434",
    temperature=0.0,
    max_tokens=1024,
    timeout=120,
))

# Check availability
print(llm.is_available())   # True if Ollama is running
print(llm.list_models())    # ["llama3.2:3b", "llama-guard3:1b", ...]

Using with HybridEvaluator

from fi.evals.local import HybridEvaluator, OllamaLLM

llm = OllamaLLM()
evaluator = HybridEvaluator(local_llm=llm, offline_mode=True)

# These LLM-based metrics now run locally via Ollama
# instead of being routed to cloud
print(evaluator.can_use_local_llm("coherence"))     # True
print(evaluator.can_use_local_llm("relevance"))     # True
print(evaluator.can_use_local_llm("groundedness"))   # True
print(evaluator.can_use_local_llm("hallucination"))  # True
print(evaluator.can_use_local_llm("safety"))         # True
print(evaluator.can_use_local_llm("tone"))           # True
print(evaluator.can_use_local_llm("bias"))           # True

Tip

LLM-based metrics that can run through Ollama: coherence, relevance, answer_relevance, context_relevance, groundedness, hallucination, safety, tone, bias, pii, custom_llm_judge.

Direct LLM usage

Use the Ollama wrapper directly for custom scoring logic:

from fi.evals.local import OllamaLLM

llm = OllamaLLM()

# Judge a response
result = llm.judge(
    query="What is 2+2?",
    response="4",
    criteria="Is the answer mathematically correct?",
    output_format="json",
)
print(result)  # {"score": 1.0, "reason": "The answer is correct"}

# Batch judge
results = llm.batch_judge([
    {"query": "Capital of France?", "response": "Paris", "criteria": "Is this correct?"},
    {"query": "2+2?", "response": "5", "criteria": "Is this correct?"},
])

# General generation
response = llm.generate("Explain SQL injection in one sentence")
print(response)

# Chat
response = llm.chat([
    {"role": "system", "content": "You are a security expert."},
    {"role": "user", "content": "Is using unvalidated input in queries safe?"},
])

Factory

Create LLM instances programmatically:

from fi.evals.local import LocalLLMFactory, LocalLLMConfig

# By backend name
llm = LocalLLMFactory.create(backend="ollama", config=LocalLLMConfig(model="llama3.2:3b"))

# From a spec string (format: "backend/model")
llm = LocalLLMFactory.from_string("ollama/llama3.2")

Metric Registry

The registry manages all locally-available metrics. Use it to discover metrics or register custom ones.

from fi.evals.local import get_registry

registry = get_registry()

# List all registered metrics
metrics = registry.list_metrics()
print(len(metrics))  # 72

# Check if a metric is registered
print(registry.is_registered("bleu_score"))  # True

# Get a metric class (use registry.create() for an instance)
metric_cls = registry.get("bleu_score")

Registering custom metrics

from fi.evals.local import get_registry
from fi.evals.metrics.base_metric import BaseMetric

class MyCustomMetric(BaseMetric):
    def compute(self, inputs):
        response = inputs.get("response", "")
        score = 1.0 if len(response) > 50 else 0.0
        return {"score": score, "reason": f"Length: {len(response)}"}

registry = get_registry()
registry.register("my_custom", MyCustomMetric)

# Now use it with LocalEvaluator
from fi.evals.local import LocalEvaluator
evaluator = LocalEvaluator()
result = evaluator.evaluate("my_custom", [{"response": "A sufficiently long response for testing purposes here"}])

Lazy registration

For metrics with heavy imports:

registry.register_lazy("heavy_metric", lambda: HeavyMetricClass)

Routing Logic

from fi.evals.local import select_routing_mode, RoutingMode

# Auto-select based on capability
mode = select_routing_mode("is_json", RoutingMode.HYBRID)       # LOCAL
mode = select_routing_mode("toxicity", RoutingMode.HYBRID)      # CLOUD
mode = select_routing_mode("is_json", RoutingMode.CLOUD)        # CLOUD - preferred_mode overrides

# Force overrides
mode = select_routing_mode("is_json", RoutingMode.HYBRID, force_local=True)   # LOCAL
mode = select_routing_mode("toxicity", RoutingMode.HYBRID, force_cloud=True)  # CLOUD

Warning

force_local=True raises ValueError if the metric isn’t in LOCAL_CAPABLE_METRICS. Only use it with metrics you know can run locally.

Result Types

LocalEvaluationResult

FieldTypeDescription
resultsBatchRunResultEvaluation results (same format as cloud)
executed_locallyset[str]Metric names that ran locally
skippedset[str]Metrics that were skipped
errorsdict[str, str]Metric name to error message
result = evaluator.evaluate_batch([...])

# Check what ran where
print(result.executed_locally)  # {"is_json", "bleu_score"}
print(result.skipped)           # {"toxicity"}  (if fail_on_unsupported=False)
print(result.errors)            # {"contains": "requires 'keyword' config"}

# Access individual results
for r in result.results.eval_results:
    print(f"{r.name}: score={r.output}, reason={r.reason}")

When to Use What

ScenarioUse
CI pipeline, no API keysLocalEvaluator or HybridEvaluator(offline_mode=True)
Air-gapped environmentHybridEvaluator + OllamaLLM
Development/testingLocalEvaluator for fast iteration
Production with cost controlHybridEvaluator(prefer_local=True)
Need toxicity/faithfulnessHybridEvaluator (routes to cloud automatically)
Need LLM scoring offlineHybridEvaluator + OllamaLLM
Was this page helpful?

Questions & Discussion