Evaluation Using Interface

Input:

  • Required Inputs:
    • output: column containing conversation history between the user and the model

Output:

  • Score: percentage score between 0 and 100

Interpretation:

  • Higher scores: Indicate that the conversation is resolved.
  • Lower scores: Suggest that the conversation is not resolved.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • messages: list[string] - conversation history between the user and the model provided as query and response pairs

Output:

  • Score: float - returns score between 0 and 1

Interpretation:

  • Higher scores: Indicate that the conversation is resolved.
  • Lower scores: Suggest that the conversation is not resolved.
from fi.evals.templates import ConversationResolution
from fi.testcases import ConversationalTestCase, LLMTestCase

test_case = ConversationalTestCase(
    messages=[
        LLMTestCase(
            query="I need help with my math homework",
            response="I'd be happy to help! What specific math problem are you working on?"
        ),
        LLMTestCase(
            query="I'm trying to solve this equation: 2x + 5 = 13",
            response="Let's solve this step by step:\\n1. Subtract 5 from both sides: 2x = 8\\n2. Divide both sides by 2: x = 4\\nThe solution is x = 4. Would you like me to verify this answer with you?"
        ),
        LLMTestCase(
            query="Yes, please check if it's correct",
            response="Let's verify:\\nIf x = 4, then:\\n2(4) + 5 = 8 + 5 = 13\\nThe equation balances, so x = 4 is correct! Is there anything else you'd like help with?"
        ),
        LLMTestCase(
            query="No, that's all. Thank you!",
            response="You're welcome! Good luck with your math studies. Feel free to ask if you need help with any other problems."
        )
    ]
)

template = ConversationResolution()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

score = response.eval_results[0].metrics[0].value


What to do when Conversation Resolution is Low

  • Add confirmation mechanisms to verify user satisfaction
  • Develop fallback responses for unclear or complex queries
  • Track common patterns in unresolved queries for improvement
  • Consider implementing a clarification system for ambiguous requests

Comparing Conversation Resolution with Similar Evals

  1. Conversation Coherence: While Resolution focuses on addressing user needs, Coherence evaluates the logical flow and context maintenance. A conversation can be perfectly coherent but fail to resolve user queries, or vice versa.
  2. Completeness: Resolution differs from Completeness as it focuses on satisfactory conclusion rather than comprehensive coverage. A response can be complete but not resolve the user’s actual need.
  3. Context Relevance: Resolution evaluates whether queries are answered, while Context Relevance assesses if the provided context is sufficient for generating responses. A response can use relevant context but still fail to resolve the user’s query.