Cloud Evals

Run pre-built evaluation templates on Future AGI's Turing cloud models. 100+ templates covering safety, RAG, hallucination, conversation quality, and more.

📝
TL;DR
  • 100+ pre-built templates on Turing cloud models (turing_flash, turing_small, turing_large)
  • Use list_evaluations() to discover available templates and filter by tag
  • Templates are updated server-side — new ones appear without upgrading your SDK

When you need to check something subjective (is this response helpful? is the tone right? did the model hallucinate?), local heuristics aren’t enough. Cloud evals send your data to Future AGI’s Turing models for scoring. Templates are managed server-side, so new ones appear without a pip upgrade. For the full platform guide on evaluations, see Evaluation docs.

Note

Requires pip install ai-evaluation and FI_API_KEY + FI_SECRET_KEY set in your environment.

Quick Example

from fi.evals import evaluate

result = evaluate("toxicity", output="You're doing a great job!", model="turing_flash")
print(result.score)   # 1.0
print(result.passed)  # True

Discovering Templates

Use list_evaluations() to see what’s available and what inputs each template needs.

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

print(f"Total templates: {len(templates)}")
# Total templates: 107

# Each template has:
t = templates[0]
print(t["name"])           # "toxicity"
print(t["description"])    # what it checks
print(t["evalTags"])       # categories like ["SAFETY", "TEXT"]
print(t["config"]["requiredKeys"])  # what inputs you need to pass

Filtering by tag

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

# Get all safety templates
safety = [t for t in templates if "SAFETY" in t.get("evalTags", [])]
print(f"Safety templates: {len(safety)}")
for t in safety:
    print(f"  {t['name']}: {t['description'][:80]}")

# Get all RAG templates
rag = [t for t in templates if "RAG" in t.get("evalTags", [])]
print(f"RAG templates: {len(rag)}")

Available tags

TagWhat it covers
SAFETYToxicity, bias, PII, content moderation, prompt injection
RAGContext adherence, chunk attribution, faithfulness, retrieval metrics
HALLUCINATIONHallucination detection, factual accuracy, groundedness
CONVERSATIONCoherence, resolution, customer agent behaviors
CHATGeneral chat quality metrics
AUDIOTranscription accuracy, audio quality, TTS/ASR
IMAGECaption hallucination, image instruction adherence
TEXTGeneral text quality (completeness, tone, helpfulness)
FUNCTIONDeterministic checks (contains, regex, JSON, similarity)
LLMSLLM-specific checks (bias, completeness, attribution)

Checking required inputs

Before calling a template, check what inputs it needs:

from fi.evals import Evaluator

evaluator = Evaluator()
templates = evaluator.list_evaluations()

# Find a specific template
toxicity = next(t for t in templates if t["name"] == "toxicity")
print(toxicity["config"]["requiredKeys"])  # what you need to pass
print(toxicity["config"].get("configParamsDesc", {}))  # parameter descriptions

Turing Models

Three tiers:

ModelSpeedUse for
turing_flash~1-2sQuick checks, high-volume scoring
turing_small~2-3sBalanced speed and accuracy
turing_large~3-5sComplex judgments, highest accuracy
from fi.evals import evaluate

# Fast check
result = evaluate("toxicity", output="...", model="turing_flash")

# More accurate
result = evaluate("toxicity", output="...", model="turing_large")

Running Cloud Evals

With the evaluate() function

from fi.evals import evaluate

# Single template
result = evaluate("tone", output="Dear Sir, I hope this finds you well.", model="turing_flash")
print(result.score)    # 1.0
print(result.passed)   # True
print(result.reason)   # detailed explanation from Turing

# Multiple inputs
result = evaluate(
    "context_adherence",
    output="Paris is the capital of France.",
    context="France is a country in Western Europe. Its capital is Paris.",
    model="turing_flash",
)

With the Evaluator class

The Evaluator class provides additional features for cloud evals: pipeline execution, async results, and batch processing.

from fi.evals import Evaluator

evaluator = Evaluator()

# Run a pipeline across a dataset
result = evaluator.evaluate_pipeline(
    project_name="my-project",
    version="v1",
    eval_data=[
        {"template": "toxicity", "output": "Hello world", "model_name": "turing_flash"},
        {"template": "tone", "output": "Dear Sir...", "model_name": "turing_flash"},
    ],
)

# Get async results
result = evaluator.get_eval_result(eval_id="abc-123")

# Get pipeline results across versions
results = evaluator.get_pipeline_results(
    project_name="my-project",
    versions=["v1", "v2"],
)

Template Reference

Grouped by category. Run list_evaluations() for the latest — new templates are added without SDK updates.

Safety (18 templates)

TemplateDescriptionInputs
toxicityToxic or harmful languageoutput
content_moderationContent safety using moderation modelsoutput
content_safety_violationBroad safety/usage policy violationsoutput
piiPersonally identifiable informationinput
prompt_injectionPrompt injection attemptsinput
protect_flashFutureAGI proprietary harm detectioninput
bias_detectionGender, racial, cultural, ideological biasoutput
no_racial_biasAbsence of racial biasoutput
no_gender_biasAbsence of gender biasoutput
no_age_biasAbsence of age biasoutput
sexistSexist content and gender biasoutput
toneTone and sentiment analysisoutput
data_privacy_complianceGDPR/HIPAA complianceoutput
is_compliantLegal/regulatory complianceoutput
is_harmful_advicePhysically/legally harmful adviceoutput
no_harmful_therapeutic_guidanceHarmful psychological/therapeutic adviceoutput
clinically_inappropriate_toneMedical tone appropriatenessoutput
answer_refusalCorrect refusal on harmful queriesinput, output

RAG & Context (14 templates)

TemplateDescriptionInputs
context_adherenceResponse stays within provided contextoutput, context
context_relevanceRetrieved context relevance to querycontext, input
groundednessOutput grounded in contextoutput, input, context
detect_hallucinationFabricated facts not in contextinput, output, context
is_factually_consistentFactual consistency with sourceinput, output, context
factual_accuracyFactual accuracy against contextinput, output, context
chunk_attributionCorrect chunk citationcontext, output
chunk_utilizationEffective use of context chunkscontext, output
completenessResponse completeness given contextinput, output
summary_qualitySummary captures main pointsinput, output
is_good_summaryClear, well-structured summaryinput, output
eval_rankingRanks context by criteriainput, context
translation_accuracyTranslation qualityinput, output
caption_hallucinationImage caption inaccuraciesimage, caption

Conversation (14 templates)

TemplateDescriptionInputs
conversation_coherenceLogical flow and context maintenanceconversation
conversation_resolutionSatisfactory conclusion reachedconversation
customer_agent_query_handlingCorrect query interpretationconversation
customer_agent_context_retentionRemembers earlier contextconversation
customer_agent_conversation_qualityOverall conversation qualityconversation
customer_agent_clarification_seekingSeeks clarification when neededconversation
customer_agent_objection_handlingHandles objections effectivelyconversation
customer_agent_human_escalationEscalates to human appropriatelyconversation
customer_agent_loop_detectionDetects repetitive loopsconversation
customer_agent_interruption_handlingWaits for user to finishconversation
customer_agent_language_handlingCorrect language/dialect handlingconversation
customer_agent_termination_handlingNo crashes or abrupt cut-offsconversation
customer_agent_prompt_conformanceAdheres to system promptsystem_prompt, conversation
TTS_accuracyText-to-speech accuracytext, generated_audio

Text Quality (12 templates)

TemplateDescriptionInputs
is_helpfulAnswers the question effectivelyinput, output
is_conciseBrief and to the pointoutput
is_politeRespectful and non-aggressiveoutput
is_informal_toneCasual tone detectionoutput
task_completionTask fulfilled accuratelyinput, output
prompt_adherenceFollows prompt instructionsinput, output
prompt_instruction_adherenceFollows format and constraintsoutput, prompt
no_apologiesNo unnecessary apologiesoutput
no_llm_referenceNo “I’m an AI” referencesoutput
contains_codeValid code in outputoutput
text_to_sqlCorrect SQL from natural languageinput, output
cultural_sensitivityCulturally appropriate languageoutput

Audio (2 templates)

TemplateDescriptionInputs
ASR/STT_accuracyTranscription accuracyaudio, generated_transcript
audio_qualityAudio quality (MOS-style)input_audio

Image (2 templates)

TemplateDescriptionInputs
image_instruction_adherenceGenerated image matches text instructioninstruction, images
synthetic_image_evaluatorDetects AI-generated imagesimage
Was this page helpful?

Questions & Discussion