Retreival Augmented Generation Evaluation using Future AGI

Step 1 - Install necessary packages and making necessary imports

!pip install --ignore-installed blinker
!pip install futureagi datasets
import json
import requests
from fi.evals import EvalClient

from fi.evals import (
    ContextAdherence,
    ContextRetrieval,
    ContextSufficiency,
    RagasAnswerCorrectness,
    RagasCoherence,
    RagasHarmfulness
)
from fi.testcases import TestCase, LLMTestCase

from datasets import load_dataset

Step 2 - Load the dataset and select an instance of the dataset

# Load the dataset
dataset = load_dataset("explodinggradients/ragas-wikiqa")
sample_data = dataset["train"]
df = sample_data.to_pandas()
df = df.head(10)
df.head()
questioncorrect_answerincorrect_answerquestion_idgenerated_with_ragcontextgenerated_without_rag
HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE USAs such, African immigrants are to be distinguished…From the Immigration and Nationality Act of 19…Q0African Americans were immigrated to the United…[African immigration to the United States refers…African Americans were immigrated to the US in…
what are points on a mortgagePoints, sometimes also called a “discount point”…Discount points may be different from originating…Q1012Points on a mortgage are a form of pre-paid…[Discount points, also called mortgage points…A mortgage point is a fee equal to 1% of the l…
how does interlibrary loan workThe user makes a request with their local library…Although books and journal articles are the most…Q102Interlibrary loan works by allowing patrons…[Interlibrary loan (abbreviated ILL, and sometimes…Interlibrary loan is a service that allows lib…
WHAT IS A FY QUARTERA fiscal year (or financial year, or sometimes…Fiscal years vary between businesses and countries…Q1027A FY quarter is a three-month period within…[April.\n\n\n=== United States ===\n\n\n==== F…A FY Quarter is a three-month period in the fi…
who wrote a rose is a rose is a roseThe sentence “Rose is a rose is a rose is a rose”…I know that in daily life we don’t go around saying…Q1032Gertrude Stein wrote the sentence “A rose is…[The sentence “Rose is a rose is a rose is a rose”…Gertrude Stein wrote “A Rose is a Rose is a Rose…”

Step 3 - Choose the evaluations you want to perform

Available RAG evaluations in Future AGI :

Context Adherence

  • Description: Ensures that responses remain within the provided context, avoiding information not present in the retrieved data.
  • Key Points: Focuses on detecting hallucinations and ensuring factual consistency.

Context Relevance

  • Description: Assesses how well the retrieved context aligns with the query.
  • Key Points: Determines sufficiency of context to address the input.

Completeness

  • Description: Evaluates whether the response fully answers the query.
  • Key Points: Focuses on providing comprehensive and accurate answers.

Chunk Attribution

  • Description: Tracks which context chunks are used in generating responses.
  • Key Points: Highlights which parts of the context contribute to the response.

Chunk Utilization

  • Description: Measures the effective usage of context chunks in generating responses.
  • Key Points: Indicates the level of relevance and reliance on the provided context.

Context Similarity

  • Description: Compares the provided context with expected context using similarity metrics.
  • Key Points: Uses techniques like cosine similarity and Jaccard index for comparison.

Groundedness

  • Description: Ensures that the response is strictly grounded in the provided context.
  • Key Points: Verifies factual reliance on retrieved information.

Summarization Accuracy

  • Description: Evaluates the accuracy of a summary against the original document.
  • Key Points: Ensures faithfulness to the source material.

Eval Context Retrieval Quality

  • Description: Assesses the quality and adequacy of the retrieved context.
  • Key Points: Measures sufficiency and relevance of the retrieved information.

Eval Ranking

  • Description: Provides ranking scores for contexts based on relevance and criteria.
  • Key Points: Prioritizes contexts that best align with the query.

Step 5 - Create an object of the chosen evaluator(s)

# Create an object of the chosen evaluator(s)
#FutureAGI Metrics

context_adherence = ContextAdherence(config={"check_internet": False})
context_retrieval = ContextRetrieval(config={
    "check_internet": False,
    "criteria": "Is context retrieved align with the input"
})
context_sufficiency = ContextSufficiency(config={
    "check_internet": False,
    "model": "gpt-4o-mini"})

metrics = {
    "context_adherence": context_adherence,
    "context_retrieval": context_retrieval,
    "context_sufficiency": context_sufficiency,
}

Step 6 - Initialize the EvalClient and run evaluations

# Initialize the EvalClient
evaluator = EvalClient(fi_api_key="your_api_key", fi_secret_key="your_secret_key", fi_base_url="https://api.futureagi.com")

for column in metrics:
    df[column] = None

for index, datapoint in df.iterrows():
    datapoint = datapoint.to_dict()
    ragas_test_case = TestCase(
        context=datapoint['context'],
        query=datapoint['question'],
        input=datapoint['question'],
        output=datapoint['generated_with_rag']
    )
    for metric in metrics:
        results = evaluator.evaluate(metrics[metric], ragas_test_case)
        df.at[index, metric] = results.eval_results[0]

Step 7 - Aggregate the results

sum_context_adherence = 0
sum_context_retrieval = 0
sum_context_sufficiency = 0

for index, datapoint in df.iterrows():
    sum_context_adherence += datapoint['context_adherence'].metrics[0].value
    sum_context_retrieval += datapoint['context_retrieval'].metrics[0].value
    sum_context_sufficiency += datapoint['context_sufficiency'].metrics[0].value

print(f"Average Context Adherence: {sum_context_adherence/len(df)}")
print(f"Average Context Retrieval: {sum_context_retrieval/len(df)}")
print(f"Average Context Sufficiency: {sum_context_sufficiency/len(df)}")
Average Context Adherence: 0.9399999999999998
Average Context Retrieval: 0.9
Average Context Sufficiency: 1.0