Evaluation Using Interface

Input:

  • Required Inputs:
    • context: The context column provided to the model.
    • response: The response column generated by the model.
  • Configuration Parameters:
    • Comparator: The method to use for comparison (Cosine Similarity, Jaccard Similarity, Normalised Levenshtein Similarity, Jaro Winckler similarity, Sorensen Dice similarity)
    • Failure Threshold: The threshold below which the evaluation fails (e.g., 0.7)

Output:

  • Score: percentage score between 0 and 100

Interpretation:

  • Higher scores: Indicate that the context is more similar to the context used in generating the response.
  • Lower scores: Indicate that the context is less similar to the context used in generating the response.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.


InputParameterTypeDescription
Required InputscontextstringThe context provided to the model.
responsestringThe response generated by the model.
Configuration ParametersComparatorstringThe method to use for comparison (Cosine Similarity, etc.) Class name shared in below table.
Failure ThresholdfloatThe threshold below which the evaluation fails (e.g., 0.7).

Comparator NameClass Name
Cosine SimilarityComparator.COSINE.value
Jaccard SimilarityComparator.JACCARD.value
Normalised Levenshtein SimilarityComparator.NORMALISED_LEVENSHTEIN.value
Jaro Winckler similarityComparator.JARO_WINKLER.value
Sorensen Dice similarityComparator.SORENSEN_DICE.value

OutputTypeDescription
ScorefloatReturns score between 0 and 1. Higher scores indicate more similarity between context and response; lower scores indicate less similarity.

from fi.testcases import TestCase
from fi.evals.types import Comparator
from fi.evals.templates import ContextSimilarity

template = ContextSimilarity(
    config={
        "comparator": Comparator.COSINE.value,
        "failure_threshold": 0.7
    }
)

test_case = TestCase(
    context="The Earth orbits around the Sun in an elliptical path.",
    response="The Earth's orbit around the Sun is not perfectly circular but elliptical."
)

result = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

score = result.eval_results[0].metrics[0].value


What to do when Context Similarity is Low

First try to identify discrepancies by determining which elements of the provided context do not align with the expected context and identifying any missing or extraneous information that affects similarity.

Next, enhance context alignment by adjusting the provided context to better match the expected context, adding missing relevant details, and removing irrelevant content.

Finally, implement system adjustments to ensure context retrieval processes prioritise similarity with the expected context, refining context processing to better align with anticipated requirements.


Differentiating Context Similarity with Similar Evals

  1. Context Relevance: Assesses whether the context is sufficient and appropriate for answering the query, while Context Similarity focuses on how closely the provided context matches the expected context.
  2. Context Adherence: Measures how well responses stay within the provided context, whereas Context Similarity evaluates the alignment between provided and expected context.