Cloud Evals: Pre-Built Eval Templates on Future AGI Turing

Run pre-built evaluation templates on Future AGI's Turing cloud models. 100+ templates covering safety, RAG, hallucination, conversation quality, and more.

📝

TL;DR

100+ pre-built templates on Turing cloud models (turing_flash, turing_small, turing_large)
Use list_evaluations() to discover available templates and filter by tag
Templates are updated server-side — new ones appear without upgrading your SDK

When you need to check something subjective (is this response helpful? is the tone right? did the model hallucinate?), local heuristics aren’t enough. Cloud evals send your data to Future AGI’s Turing models for scoring. Templates are managed server-side, so new ones appear without a pip upgrade. For the full platform guide on evaluations, see Evaluation docs.

Note

Requires pip install ai-evaluation and FI_API_KEY + FI_SECRET_KEY set in your environment.

Quick Example

from fi.evals import evaluate

result = evaluate("toxicity", output="You're doing a great job!", model="turing_flash")
print(result.score)   # 1.0
print(result.passed)  # True

Discovering Templates

Use list_evaluations() to see what’s available and what inputs each template needs.

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

print(f"Total templates: {len(templates)}")
# Total templates: 107

# Each template has:
t = templates[0]
print(t["name"])           # "toxicity"
print(t["description"])    # what it checks
print(t["evalTags"])       # categories like ["SAFETY", "TEXT"]
print(t["config"]["requiredKeys"])  # what inputs you need to pass

Filtering by tag

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

# Get all safety templates
safety = [t for t in templates if "SAFETY" in t.get("evalTags", [])]
print(f"Safety templates: {len(safety)}")
for t in safety:
    print(f"  {t['name']}: {t['description'][:80]}")

# Get all RAG templates
rag = [t for t in templates if "RAG" in t.get("evalTags", [])]
print(f"RAG templates: {len(rag)}")

Available tags

Tag	What it covers
`SAFETY`	Toxicity, bias, PII, content moderation, prompt injection
`RAG`	Context adherence, chunk attribution, faithfulness, retrieval metrics
`HALLUCINATION`	Hallucination detection, factual accuracy, groundedness
`CONVERSATION`	Coherence, resolution, customer agent behaviors
`CHAT`	General chat quality metrics
`AUDIO`	Transcription accuracy, audio quality, TTS/ASR
`IMAGE`	Caption hallucination, image instruction adherence
`TEXT`	General text quality (completeness, tone, helpfulness)
`FUNCTION`	Deterministic checks (contains, regex, JSON, similarity)
`LLMS`	LLM-specific checks (bias, completeness, attribution)

Checking required inputs

Before calling a template, check what inputs it needs:

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

# Find a specific template
toxicity = next(t for t in templates if t["name"] == "toxicity")
print(toxicity["config"]["requiredKeys"])  # what you need to pass
print(toxicity["config"].get("configParamsDesc", {}))  # parameter descriptions

Turing Models

Three tiers:

Model	Speed	Use for
`turing_flash`	~1-2s	Quick checks, high-volume scoring
`turing_small`	~2-3s	Balanced speed and accuracy
`turing_large`	~3-5s	Complex judgments, highest accuracy

from fi.evals import evaluate

# Fast check
result = evaluate("toxicity", output="...", model="turing_flash")

# More accurate
result = evaluate("toxicity", output="...", model="turing_large")

Running Cloud Evals

With the evaluate() function

from fi.evals import evaluate

# Single template
result = evaluate("tone", output="Dear Sir, I hope this finds you well.", model="turing_flash")
print(result.score)    # 1.0
print(result.passed)   # True
print(result.reason)   # detailed explanation from Turing

# Multiple inputs
result = evaluate(
    "context_adherence",
    output="Paris is the capital of France.",
    context="France is a country in Western Europe. Its capital is Paris.",
    model="turing_flash",
)

With the Evaluator class

The Evaluator class provides additional features for cloud evals: pipeline execution, async results, and batch processing.

from fi.evals import Evaluator

evaluator = Evaluator()

# Run a pipeline across a dataset
result = evaluator.evaluate_pipeline(
    project_name="my-project",
    version="v1",
    eval_data=[
        {"template": "toxicity", "output": "Hello world", "model_name": "turing_flash"},
        {"template": "tone", "output": "Dear Sir...", "model_name": "turing_flash"},
    ],
)

# Get async results
result = evaluator.get_eval_result(eval_id="abc-123")

# Get pipeline results across versions
results = evaluator.get_pipeline_results(
    project_name="my-project",
    versions=["v1", "v2"],
)

Template Reference

Grouped by category. Run list_evaluations() for the latest — new templates are added without SDK updates.

Safety (18 templates)

Template	Description	Inputs
`toxicity`	Toxic or harmful language	`output`
`content_moderation`	Content safety using moderation models	`output`
`content_safety_violation`	Broad safety/usage policy violations	`output`
`pii`	Personally identifiable information	`input`
`prompt_injection`	Prompt injection attempts	`input`
`protect_flash`	FutureAGI proprietary harm detection	`input`
`bias_detection`	Gender, racial, cultural, ideological bias	`output`
`no_racial_bias`	Absence of racial bias	`output`
`no_gender_bias`	Absence of gender bias	`output`
`no_age_bias`	Absence of age bias	`output`
`sexist`	Sexist content and gender bias	`output`
`tone`	Tone and sentiment analysis	`output`
`data_privacy_compliance`	GDPR/HIPAA compliance	`output`
`is_compliant`	Legal/regulatory compliance	`output`
`is_harmful_advice`	Physically/legally harmful advice	`output`
`no_harmful_therapeutic_guidance`	Harmful psychological/therapeutic advice	`output`
`clinically_inappropriate_tone`	Medical tone appropriateness	`output`
`answer_refusal`	Correct refusal on harmful queries	`input`, `output`

RAG & Context (14 templates)

Template	Description	Inputs
`context_adherence`	Response stays within provided context	`output`, `context`
`context_relevance`	Retrieved context relevance to query	`context`, `input`
`groundedness`	Output grounded in context	`output`, `input`, `context`
`detect_hallucination`	Fabricated facts not in context	`input`, `output`, `context`
`is_factually_consistent`	Factual consistency with source	`input`, `output`, `context`
`factual_accuracy`	Factual accuracy against context	`input`, `output`, `context`
`chunk_attribution`	Correct chunk citation	`context`, `output`
`chunk_utilization`	Effective use of context chunks	`context`, `output`
`completeness`	Response completeness given context	`input`, `output`
`summary_quality`	Summary captures main points	`input`, `output`
`is_good_summary`	Clear, well-structured summary	`input`, `output`
`eval_ranking`	Ranks context by criteria	`input`, `context`
`translation_accuracy`	Translation quality	`input`, `output`
`caption_hallucination`	Image caption inaccuracies	`image`, `caption`

Conversation (14 templates)

Template	Description	Inputs
`conversation_coherence`	Logical flow and context maintenance	`conversation`
`conversation_resolution`	Satisfactory conclusion reached	`conversation`
`customer_agent_query_handling`	Correct query interpretation	`conversation`
`customer_agent_context_retention`	Remembers earlier context	`conversation`
`customer_agent_conversation_quality`	Overall conversation quality	`conversation`
`customer_agent_clarification_seeking`	Seeks clarification when needed	`conversation`
`customer_agent_objection_handling`	Handles objections effectively	`conversation`
`customer_agent_human_escalation`	Escalates to human appropriately	`conversation`
`customer_agent_loop_detection`	Detects repetitive loops	`conversation`
`customer_agent_interruption_handling`	Waits for user to finish	`conversation`
`customer_agent_language_handling`	Correct language/dialect handling	`conversation`
`customer_agent_termination_handling`	No crashes or abrupt cut-offs	`conversation`
`customer_agent_prompt_conformance`	Adheres to system prompt	`system_prompt`, `conversation`
`TTS_accuracy`	Text-to-speech accuracy	`text`, `generated_audio`

Text Quality (12 templates)

Template	Description	Inputs
`is_helpful`	Answers the question effectively	`input`, `output`
`is_concise`	Brief and to the point	`output`
`is_polite`	Respectful and non-aggressive	`output`
`is_informal_tone`	Casual tone detection	`output`
`task_completion`	Task fulfilled accurately	`input`, `output`
`prompt_adherence`	Follows prompt instructions	`input`, `output`
`prompt_instruction_adherence`	Follows format and constraints	`output`, `prompt`
`no_apologies`	No unnecessary apologies	`output`
`no_llm_reference`	No “I’m an AI” references	`output`
`contains_code`	Valid code in output	`output`
`text_to_sql`	Correct SQL from natural language	`input`, `output`
`cultural_sensitivity`	Culturally appropriate language	`output`

Audio (2 templates)

Template	Description	Inputs
`ASR/STT_accuracy`	Transcription accuracy	`audio`, `generated_transcript`
`audio_quality`	Audio quality (MOS-style)	`input_audio`

Image (2 templates)

Template	Description	Inputs
`image_instruction_adherence`	Generated image matches text instruction	`instruction`, `images`
`synthetic_image_evaluator`	Detects AI-generated images	`image`

Questions & Discussion

Cloud Evals: Pre-Built Eval Templates on Future AGI Turing

Quick Example

Discovering Templates

Filtering by tag

Available tags

Checking required inputs

Turing Models

Running Cloud Evals

With the evaluate() function

With the Evaluator class

Template Reference

Safety (18 templates)

RAG & Context (14 templates)

Conversation (14 templates)

Text Quality (12 templates)

Audio (2 templates)

Image (2 templates)

Metrics Reference

LLM-as-Judge

Running Evaluations

Quick Example

Discovering Templates

Filtering by tag

Available tags

Checking required inputs

Turing Models

Running Cloud Evals

With the evaluate() function

With the Evaluator class

Template Reference

Safety (18 templates)

RAG & Context (14 templates)

Conversation (14 templates)

Text Quality (12 templates)

Audio (2 templates)

Image (2 templates)

Related

Metrics Reference

LLM-as-Judge

Running Evaluations