Evaluation Using Interface

Input:

  • Required Inputs:
    • input: The text or instruction column that serves as the reference for evaluation.
    • rule_prompt: A guideline or rule column used to measure the linkage. This can include dynamic placeholders (e.g., ).
  • Optional Inputs:
    • Note: While the definition mentions input/output images, the provided parameters focus on text/instruction and rule prompt. Add image inputs here if they are configurable via the interface.

Output:

  • Score: Percentage score between 0 and 100

Interpretation:

  • Higher scores: Indicate strong alignment and coherence between the input/instruction and the rule prompt.
  • Lower scores: Suggest discrepancies or misalignment.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.


Input TypeParameterTypeDescription
Required InputsinputstringThe text or instruction that serves as the reference for evaluation.
rule_promptstringA guideline or rule used to measure the linkage.
Optional InputsAdd image parameters here if applicable via SDK

OutputTypeDescription
ScorefloatReturns a score between 0 and 1, where higher values indicate better alignment/coherence.

from fi.evals import EvalClient
from fi.evals.templates import ScoreEval
from fi.testcases import MLLMTestCase

test_case = MLLMTestCase(
    input="Score the sunset beach photo's composition and atmosphere",
    rule_string="""
    Evaluate:
    1. Golden hour lighting quality
    2. Rule of thirds composition
    3. Foreground to background balance
    4. Color harmony and mood
    """
)

template = ScoreEval(
    config={
        "input": {
            "type": "rule_string",
            "default": ["composition", "lighting", "atmosphere"]
        },
        "rulePrompt": {
            "type": "rule_prompt",
            "default": """
            Provide a score based on:
            - Composition (0-0.25): Rule of thirds, leading lines
            - Lighting (0-0.25): Quality of light, shadows, highlights
            - Atmosphere (0-0.25): Mood, emotional impact
            - Technical (0-0.25): Focus, exposure, clarity
            """
        }
    }
)

evaluator = EvalClient(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key",
    fi_base_url="<https://api.futureagi.com>"
)

result = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

score = result.eval_results[0].metrics[0].value
reason = result.eval_results[0].reason

print(f"Evaluation Score: {score}")
print(f"Evaluation Reason: {reason}")


What to do if Score Eval Gives Low Score

The evaluation criteria should be reassessed to ensure they are clearly defined and aligned with the intended evaluation goals. Adjustments may be necessary to make the criteria more comprehensive and relevant.

Additionally, examining the output images for alignment with instructions and input images can help identify discrepancies or misalignments.

Refining the instructions or improving the image generation process can enhance the overall evaluation outcome.


Differentiating Score Eval with Eval Image Instruction

Eval Image Instruction focuses specifically on assessing the alignment between textual instructions and image, ensuring that the generated image accurately represents the given instructions. In contrast, Score Eval has a broader scope, evaluating coherence and alignment across multiple inputs and outputs, including both text and images.

Eval Image Instruction assesses instruction-image accuracy, whereas Score Eval examines overall coherence and adherence to instructions. Eval Image Instruction is ideal for cases where precise image representation is the main concern, while Score Eval is better suited for complex scenarios involving multiple modalities, ensuring comprehensive alignment and coherence.