Answer Similarity

Evaluation Using Interface

Input:

Required Inputs:
- expected_response: The reference answer column.
- response: The generated answer column.
Configuration Parameters:
- Comparator: The method used for comparison (e.g., Cosine, Exact Match).
- Failure Threshold: Float (e.g., 0.7) - The similarity score below which the evaluation is considered a failure.

Output:

Score: Percentage score between 0 and 100

Interpretation:

Scores ≥ (Failure Threshold * 100): Indicate that the generated response is sufficiently similar to the expected_response based on the chosen Comparator.
Scores < (Failure Threshold * 100): Suggest that the response deviates significantly from the expected_response.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input Type	Parameter	Type	Description
Required Inputs	`expected_response`	`string`	The reference answer.
	`response`	`string`	The generated answer.
Configuration Parameters	`comparator`	`string`	The method to use for comparison (e.g., `Comparator.COSINE.value`).
	`failure_threshold`	`float`	The threshold below which the evaluation fails (e.g., 0.7).

Comparator Name	Class Name
Cosine Similarity	`Comparator.COSINE.value`
Jaccard Similarity	`Comparator.JACCARD.value`
Normalised Levenshtein Similarity	`Comparator.NORMALISED_LEVENSHTEIN.value`
Jaro Winckler similarity	`Comparator.JARO_WINKLER.value`
Sorensen Dice similarity	`Comparator.SORENSEN_DICE.value`

Output	Type	Description
`Score`	`float`	Returns a score between 0 and 1. Values ≥ `failure_threshold` indicate sufficient similarity.

from fi.evals import Evaluator
from fi.testcases import LLMTestCase
from fi.evals.templates import AnswerSimilarity
from fi.evals.types import Comparator

similarity_eval = AnswerSimilarity(config={
    "comparator": Comparator.COSINE.value,
    "failure_threshold": 0.8
})

test_case = LLMTestCase(
    response="example response",
    expected_response="example of expected response"
)

evaluator = Evaluator()
result = evaluator.evaluate(eval_templates=[similarity_eval], inputs=[test_case], model_name="turing_flash")
similarity_score = result.eval_results[0].metrics[0].value

What to Do When Answer Similarity Evaluation is Low

A response review should be conducted to reassess the actual response’s alignment with the expected response and identify discrepancies. If necessary, a comparator adjustment can be made, selecting an alternative similarity measure that better captures nuanced differences in meaning.

Differentiating Answer Similarity with Context Relevance

Answer Similarity specifically measures how closely two responses align in meaning, whereas Context Sufficiency determines whether a given context provides enough information to answer a query. From an input perspective, Answer Similarity requires both an expected and actual response for comparison, while Context Sufficiency evaluates a query against its provided context.

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Evaluation Using Interface

Evaluation Using Python SDK

What to Do When Answer Similarity Evaluation is Low

Differentiating Answer Similarity with Context Relevance

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Evaluation Using Interface

​Evaluation Using Python SDK

​What to Do When Answer Similarity Evaluation is Low

​Differentiating Answer Similarity with Context Relevance

Evaluation Using Interface

Evaluation Using Python SDK

What to Do When Answer Similarity Evaluation is Low

Differentiating Answer Similarity with Context Relevance