Cloud Evals
Run pre-built evaluation templates on Future AGI's Turing cloud models. 100+ templates covering safety, RAG, hallucination, conversation quality, and more.
- 100+ pre-built templates on Turing cloud models (
turing_flash,turing_small,turing_large) - Use
list_evaluations()to discover available templates and filter by tag - Templates are updated server-side — new ones appear without upgrading your SDK
When you need to check something subjective (is this response helpful? is the tone right? did the model hallucinate?), local heuristics aren’t enough. Cloud evals send your data to Future AGI’s Turing models for scoring. Templates are managed server-side, so new ones appear without a pip upgrade. For the full platform guide on evaluations, see Evaluation docs.
Note
Requires pip install ai-evaluation and FI_API_KEY + FI_SECRET_KEY set in your environment.
Quick Example
from fi.evals import evaluate
result = evaluate("toxicity", output="You're doing a great job!", model="turing_flash")
print(result.score) # 1.0
print(result.passed) # True
Discovering Templates
Use list_evaluations() to see what’s available and what inputs each template needs.
from fi.evals import Evaluator
evaluator = Evaluator()
templates = evaluator.list_evaluations()
print(f"Total templates: {len(templates)}")
# Total templates: 107
# Each template has:
t = templates[0]
print(t["name"]) # "toxicity"
print(t["description"]) # what it checks
print(t["evalTags"]) # categories like ["SAFETY", "TEXT"]
print(t["config"]["requiredKeys"]) # what inputs you need to pass
Filtering by tag
from fi.evals import Evaluator
evaluator = Evaluator()
templates = evaluator.list_evaluations()
# Get all safety templates
safety = [t for t in templates if "SAFETY" in t.get("evalTags", [])]
print(f"Safety templates: {len(safety)}")
for t in safety:
print(f" {t['name']}: {t['description'][:80]}")
# Get all RAG templates
rag = [t for t in templates if "RAG" in t.get("evalTags", [])]
print(f"RAG templates: {len(rag)}")
Available tags
| Tag | What it covers |
|---|---|
SAFETY | Toxicity, bias, PII, content moderation, prompt injection |
RAG | Context adherence, chunk attribution, faithfulness, retrieval metrics |
HALLUCINATION | Hallucination detection, factual accuracy, groundedness |
CONVERSATION | Coherence, resolution, customer agent behaviors |
CHAT | General chat quality metrics |
AUDIO | Transcription accuracy, audio quality, TTS/ASR |
IMAGE | Caption hallucination, image instruction adherence |
TEXT | General text quality (completeness, tone, helpfulness) |
FUNCTION | Deterministic checks (contains, regex, JSON, similarity) |
LLMS | LLM-specific checks (bias, completeness, attribution) |
Checking required inputs
Before calling a template, check what inputs it needs:
from fi.evals import Evaluator
evaluator = Evaluator()
templates = evaluator.list_evaluations()
# Find a specific template
toxicity = next(t for t in templates if t["name"] == "toxicity")
print(toxicity["config"]["requiredKeys"]) # what you need to pass
print(toxicity["config"].get("configParamsDesc", {})) # parameter descriptions
Turing Models
Three tiers:
| Model | Speed | Use for |
|---|---|---|
turing_flash | ~1-2s | Quick checks, high-volume scoring |
turing_small | ~2-3s | Balanced speed and accuracy |
turing_large | ~3-5s | Complex judgments, highest accuracy |
from fi.evals import evaluate
# Fast check
result = evaluate("toxicity", output="...", model="turing_flash")
# More accurate
result = evaluate("toxicity", output="...", model="turing_large")
Running Cloud Evals
With the evaluate() function
from fi.evals import evaluate
# Single template
result = evaluate("tone", output="Dear Sir, I hope this finds you well.", model="turing_flash")
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # detailed explanation from Turing
# Multiple inputs
result = evaluate(
"context_adherence",
output="Paris is the capital of France.",
context="France is a country in Western Europe. Its capital is Paris.",
model="turing_flash",
)
With the Evaluator class
The Evaluator class provides additional features for cloud evals: pipeline execution, async results, and batch processing.
from fi.evals import Evaluator
evaluator = Evaluator()
# Run a pipeline across a dataset
result = evaluator.evaluate_pipeline(
project_name="my-project",
version="v1",
eval_data=[
{"template": "toxicity", "output": "Hello world", "model_name": "turing_flash"},
{"template": "tone", "output": "Dear Sir...", "model_name": "turing_flash"},
],
)
# Get async results
result = evaluator.get_eval_result(eval_id="abc-123")
# Get pipeline results across versions
results = evaluator.get_pipeline_results(
project_name="my-project",
versions=["v1", "v2"],
)
Template Reference
Grouped by category. Run list_evaluations() for the latest — new templates are added without SDK updates.
Safety (18 templates)
| Template | Description | Inputs |
|---|---|---|
toxicity | Toxic or harmful language | output |
content_moderation | Content safety using moderation models | output |
content_safety_violation | Broad safety/usage policy violations | output |
pii | Personally identifiable information | input |
prompt_injection | Prompt injection attempts | input |
protect_flash | FutureAGI proprietary harm detection | input |
bias_detection | Gender, racial, cultural, ideological bias | output |
no_racial_bias | Absence of racial bias | output |
no_gender_bias | Absence of gender bias | output |
no_age_bias | Absence of age bias | output |
sexist | Sexist content and gender bias | output |
tone | Tone and sentiment analysis | output |
data_privacy_compliance | GDPR/HIPAA compliance | output |
is_compliant | Legal/regulatory compliance | output |
is_harmful_advice | Physically/legally harmful advice | output |
no_harmful_therapeutic_guidance | Harmful psychological/therapeutic advice | output |
clinically_inappropriate_tone | Medical tone appropriateness | output |
answer_refusal | Correct refusal on harmful queries | input, output |
RAG & Context (14 templates)
| Template | Description | Inputs |
|---|---|---|
context_adherence | Response stays within provided context | output, context |
context_relevance | Retrieved context relevance to query | context, input |
groundedness | Output grounded in context | output, input, context |
detect_hallucination | Fabricated facts not in context | input, output, context |
is_factually_consistent | Factual consistency with source | input, output, context |
factual_accuracy | Factual accuracy against context | input, output, context |
chunk_attribution | Correct chunk citation | context, output |
chunk_utilization | Effective use of context chunks | context, output |
completeness | Response completeness given context | input, output |
summary_quality | Summary captures main points | input, output |
is_good_summary | Clear, well-structured summary | input, output |
eval_ranking | Ranks context by criteria | input, context |
translation_accuracy | Translation quality | input, output |
caption_hallucination | Image caption inaccuracies | image, caption |
Conversation (14 templates)
| Template | Description | Inputs |
|---|---|---|
conversation_coherence | Logical flow and context maintenance | conversation |
conversation_resolution | Satisfactory conclusion reached | conversation |
customer_agent_query_handling | Correct query interpretation | conversation |
customer_agent_context_retention | Remembers earlier context | conversation |
customer_agent_conversation_quality | Overall conversation quality | conversation |
customer_agent_clarification_seeking | Seeks clarification when needed | conversation |
customer_agent_objection_handling | Handles objections effectively | conversation |
customer_agent_human_escalation | Escalates to human appropriately | conversation |
customer_agent_loop_detection | Detects repetitive loops | conversation |
customer_agent_interruption_handling | Waits for user to finish | conversation |
customer_agent_language_handling | Correct language/dialect handling | conversation |
customer_agent_termination_handling | No crashes or abrupt cut-offs | conversation |
customer_agent_prompt_conformance | Adheres to system prompt | system_prompt, conversation |
TTS_accuracy | Text-to-speech accuracy | text, generated_audio |
Text Quality (12 templates)
| Template | Description | Inputs |
|---|---|---|
is_helpful | Answers the question effectively | input, output |
is_concise | Brief and to the point | output |
is_polite | Respectful and non-aggressive | output |
is_informal_tone | Casual tone detection | output |
task_completion | Task fulfilled accurately | input, output |
prompt_adherence | Follows prompt instructions | input, output |
prompt_instruction_adherence | Follows format and constraints | output, prompt |
no_apologies | No unnecessary apologies | output |
no_llm_reference | No “I’m an AI” references | output |
contains_code | Valid code in output | output |
text_to_sql | Correct SQL from natural language | input, output |
cultural_sensitivity | Culturally appropriate language | output |
Audio (2 templates)
| Template | Description | Inputs |
|---|---|---|
ASR/STT_accuracy | Transcription accuracy | audio, generated_transcript |
audio_quality | Audio quality (MOS-style) | input_audio |
Image (2 templates)
| Template | Description | Inputs |
|---|---|---|
image_instruction_adherence | Generated image matches text instruction | instruction, images |
synthetic_image_evaluator | Detects AI-generated images | image |