Evaluating RAG Applications with Future AGI

Evaluate RAG applications with Future AGI using context adherence, retrieval quality, answer correctness, and other retrieval-augmented generation metrics.

Retreival Augmented Generation Evaluation using Future AGI

Step 1 - Install necessary packages and making necessary imports

!pip install --ignore-installed blinker
!pip install futureagi datasets

import json
import requests

from fi.evals import Evaluator

from fi.evals import (
    ContextAdherence,
    ContextRetrieval,
    ContextSufficiency,
    RagasAnswerCorrectness,
    RagasCoherence,
    RagasHarmfulness
)
from fi.testcases import TestCase, LLMTestCase

from datasets import load_dataset

Step 2 - Load the dataset and select an instance of the dataset

# Load the dataset
dataset = load_dataset("explodinggradients/ragas-wikiqa")
sample_data = dataset["train"]
df = sample_data.to_pandas()
df = df.head(10)
df.head()

question	correct_answer	incorrect_answer	question_id	generated_with_rag	context	generated_without_rag
HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US	As such, African immigrants are to be distinguished…	From the Immigration and Nationality Act of 19…	Q0	African Americans were immigrated to the United…	[African immigration to the United States refers…	African Americans were immigrated to the US in…
what are points on a mortgage	Points, sometimes also called a “discount point”…	Discount points may be different from originating…	Q1012	Points on a mortgage are a form of pre-paid…	[Discount points, also called mortgage points…	A mortgage point is a fee equal to 1% of the l…
how does interlibrary loan work	The user makes a request with their local library…	Although books and journal articles are the most…	Q102	Interlibrary loan works by allowing patrons…	[Interlibrary loan (abbreviated ILL, and sometimes…	Interlibrary loan is a service that allows lib…
WHAT IS A FY QUARTER	A fiscal year (or financial year, or sometimes…	Fiscal years vary between businesses and countries…	Q1027	A FY quarter is a three-month period within…	[April.\n\n\n=== United States ===\n\n\n==== F…	A FY Quarter is a three-month period in the fi…
who wrote a rose is a rose is a rose	The sentence “Rose is a rose is a rose is a rose”…	I know that in daily life we don’t go around saying…	Q1032	Gertrude Stein wrote the sentence “A rose is…	[The sentence “Rose is a rose is a rose is a rose”…	Gertrude Stein wrote “A Rose is a Rose is a Rose…”

Step 3 - Choose the evaluations you want to perform

Available RAG evaluations in Future AGI :

Context Adherence

Description: Ensures that responses remain within the provided context, avoiding information not present in the retrieved data.
Key Points: Focuses on detecting hallucinations and ensuring factual consistency.

Context Relevance

Description: Assesses how well the retrieved context aligns with the query.
Key Points: Determines sufficiency of context to address the input.

Completeness

Description: Evaluates whether the response fully answers the query.
Key Points: Focuses on providing comprehensive and accurate answers.

Chunk Attribution

Description: Tracks which context chunks are used in generating responses.
Key Points: Highlights which parts of the context contribute to the response.

Chunk Utilization

Description: Measures the effective usage of context chunks in generating responses.
Key Points: Indicates the level of relevance and reliance on the provided context.

Context Similarity

Description: Compares the provided context with expected context using similarity metrics.
Key Points: Uses techniques like cosine similarity and Jaccard index for comparison.

Groundedness

Description: Ensures that the response is strictly grounded in the provided context.
Key Points: Verifies factual reliance on retrieved information.

Summarization Accuracy

Description: Evaluates the accuracy of a summary against the original document.
Key Points: Ensures faithfulness to the source material.

Eval Context Retrieval Quality

Description: Assesses the quality and adequacy of the retrieved context.
Key Points: Measures sufficiency and relevance of the retrieved information.

Eval Ranking

Description: Provides ranking scores for contexts based on relevance and criteria.
Key Points: Prioritizes contexts that best align with the query.

Step 5 - Create an object of the chosen evaluator(s)

# Create an object of the chosen evaluator(s)
#FutureAGI Metrics

context_adherence = ContextAdherence(config={"check_internet": False})
context_retrieval = ContextRetrieval(config={
    "check_internet": False,
    "criteria": "Is context retrieved align with the input"
})
context_sufficiency = ContextSufficiency(config={
    "check_internet": False,
    "model": "gpt-4o-mini"})

metrics = {
    "context_adherence": context_adherence,
    "context_retrieval": context_retrieval,
    "context_sufficiency": context_sufficiency,
}

Step 6 - Initialize the Evaluator and run evaluations

# Initialize the Evaluator
evaluator = Evaluator(fi_api_key="your_api_key", fi_secret_key="your_secret_key", fi_base_url="https://api.futureagi.com")

for column in metrics:
    df[column] = None

for index, datapoint in df.iterrows():
    datapoint = datapoint.to_dict()
    ragas_test_case = TestCase(
        context=datapoint['context'],
        query=datapoint['question'],
        input=datapoint['question'],
        output=datapoint['generated_with_rag']
    )
    for metric in metrics:
        results = evaluator.evaluate(metrics[metric], ragas_test_case)
        df.at[index, metric] = results.eval_results[0]

Step 7 - Aggregate the results

sum_context_adherence = 0
sum_context_retrieval = 0
sum_context_sufficiency = 0

for index, datapoint in df.iterrows():
    sum_context_adherence += datapoint['context_adherence'].metrics[0].value
    sum_context_retrieval += datapoint['context_retrieval'].metrics[0].value
    sum_context_sufficiency += datapoint['context_sufficiency'].metrics[0].value

print(f"Average Context Adherence: {sum_context_adherence/len(df)}")
print(f"Average Context Retrieval: {sum_context_retrieval/len(df)}")
print(f"Average Context Sufficiency: {sum_context_sufficiency/len(df)}")

Average Context Adherence: 0.9399999999999998
Average Context Retrieval: 0.9
Average Context Sufficiency: 1.0

Was this page helpful?

Questions & Discussion