Evaluation Using Interface

Input:

  • Required Inputs:
    • output: column containing conversation history between the user and the model

Output:

  • Score: percentage score between 0 and 100

Interpretation:

  • Higher scores: Indicate that the conversation is more coherent.
  • Lower scores: Suggest that the conversation is less coherent.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • messages: list[string] - conversation history between the user and the model provided as query and response pairs

Output:

  • Score: float - returns score between 0 and 1

Interpretation:

  • Higher scores: Indicate that the conversation is more coherent.
  • Lower scores: Suggest that the conversation is less coherent.
from fi.evals.templates import ConversationCoherence
from fi.testcases import ConversationalTestCase, LLMTestCase

test_case = ConversationalTestCase(
    messages=[
        LLMTestCase(
            query="Hi, how are you?",
            response="I'm doing well, thank you! How can I help you today?"
        ),
        LLMTestCase(
            query="I need help with my homework",
            response="I'd be happy to help. What subject are you working on?"
        )
    ]
)

template = ConversationCoherence()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

score = response.eval_results[0].metrics[0].value


What to do when Conversation Coherence is Low

  • Review conversation history to identify where context breaks occurred
  • Implement context window management to ensure important information is retained
  • Consider reducing the length of conversation threads if context loss is persistent

Comparing Conversation Coherence with Similar Evals

  1. Conversation Resolution: While Coherence focuses on the flow and context maintenance throughout the conversation, Resolution evaluates whether the conversation reaches a satisfactory conclusion.
  2. Context Adherence: Coherence differs from Context Adherence as it evaluates the internal consistency of the conversation rather than adherence to external context.
  3. Completeness: Coherence focuses on the logical flow between messages, while Completeness evaluates whether individual responses fully address their queries.