Evaluating RAG Applications
Evaluate RAG applications with Future AGI using context adherence, retrieval quality, answer correctness, and other retrieval-augmented generation metrics.
Retreival Augmented Generation Evaluation using Future AGI
Step 1 - Install necessary packages and making necessary imports
!pip install --ignore-installed blinker
!pip install futureagi datasets
import json
import requests
from fi.evals import Evaluator
from fi.evals import (
ContextAdherence,
ContextRetrieval,
ContextSufficiency,
RagasAnswerCorrectness,
RagasCoherence,
RagasHarmfulness
)
from fi.testcases import TestCase, LLMTestCase
from datasets import load_dataset
Step 2 - Load the dataset and select an instance of the dataset
# Load the dataset
dataset = load_dataset("explodinggradients/ragas-wikiqa")
sample_data = dataset["train"]
df = sample_data.to_pandas()
df = df.head(10)
df.head()
| question | correct_answer | incorrect_answer | question_id | generated_with_rag | context | generated_without_rag |
|---|---|---|---|---|---|---|
| HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US | As such, African immigrants are to be distinguished… | From the Immigration and Nationality Act of 19… | Q0 | African Americans were immigrated to the United… | [African immigration to the United States refers… | African Americans were immigrated to the US in… |
| what are points on a mortgage | Points, sometimes also called a “discount point”… | Discount points may be different from originating… | Q1012 | Points on a mortgage are a form of pre-paid… | [Discount points, also called mortgage points… | A mortgage point is a fee equal to 1% of the l… |
| how does interlibrary loan work | The user makes a request with their local library… | Although books and journal articles are the most… | Q102 | Interlibrary loan works by allowing patrons… | [Interlibrary loan (abbreviated ILL, and sometimes… | Interlibrary loan is a service that allows lib… |
| WHAT IS A FY QUARTER | A fiscal year (or financial year, or sometimes… | Fiscal years vary between businesses and countries… | Q1027 | A FY quarter is a three-month period within… | [April.\n\n\n=== United States ===\n\n\n==== F… | A FY Quarter is a three-month period in the fi… |
| who wrote a rose is a rose is a rose | The sentence “Rose is a rose is a rose is a rose”… | I know that in daily life we don’t go around saying… | Q1032 | Gertrude Stein wrote the sentence “A rose is… | [The sentence “Rose is a rose is a rose is a rose”… | Gertrude Stein wrote “A Rose is a Rose is a Rose…” |
Step 3 - Choose the evaluations you want to perform
Available RAG evaluations in Future AGI :
Context Adherence
- Description: Ensures that responses remain within the provided context, avoiding information not present in the retrieved data.
- Key Points: Focuses on detecting hallucinations and ensuring factual consistency.
Context Relevance
- Description: Assesses how well the retrieved context aligns with the query.
- Key Points: Determines sufficiency of context to address the input.
Completeness
- Description: Evaluates whether the response fully answers the query.
- Key Points: Focuses on providing comprehensive and accurate answers.
Chunk Attribution
- Description: Tracks which context chunks are used in generating responses.
- Key Points: Highlights which parts of the context contribute to the response.
Chunk Utilization
- Description: Measures the effective usage of context chunks in generating responses.
- Key Points: Indicates the level of relevance and reliance on the provided context.
Context Similarity
- Description: Compares the provided context with expected context using similarity metrics.
- Key Points: Uses techniques like cosine similarity and Jaccard index for comparison.
Groundedness
- Description: Ensures that the response is strictly grounded in the provided context.
- Key Points: Verifies factual reliance on retrieved information.
Summarization Accuracy
- Description: Evaluates the accuracy of a summary against the original document.
- Key Points: Ensures faithfulness to the source material.
Eval Context Retrieval Quality
- Description: Assesses the quality and adequacy of the retrieved context.
- Key Points: Measures sufficiency and relevance of the retrieved information.
Eval Ranking
- Description: Provides ranking scores for contexts based on relevance and criteria.
- Key Points: Prioritizes contexts that best align with the query.
Step 5 - Create an object of the chosen evaluator(s)
# Create an object of the chosen evaluator(s)
#FutureAGI Metrics
context_adherence = ContextAdherence(config={"check_internet": False})
context_retrieval = ContextRetrieval(config={
"check_internet": False,
"criteria": "Is context retrieved align with the input"
})
context_sufficiency = ContextSufficiency(config={
"check_internet": False,
"model": "gpt-4o-mini"})
metrics = {
"context_adherence": context_adherence,
"context_retrieval": context_retrieval,
"context_sufficiency": context_sufficiency,
}
Step 6 - Initialize the Evaluator and run evaluations
# Initialize the Evaluator
evaluator = Evaluator(fi_api_key="your_api_key", fi_secret_key="your_secret_key", fi_base_url="https://api.futureagi.com")
for column in metrics:
df[column] = None
for index, datapoint in df.iterrows():
datapoint = datapoint.to_dict()
ragas_test_case = TestCase(
context=datapoint['context'],
query=datapoint['question'],
input=datapoint['question'],
output=datapoint['generated_with_rag']
)
for metric in metrics:
results = evaluator.evaluate(metrics[metric], ragas_test_case)
df.at[index, metric] = results.eval_results[0]
Step 7 - Aggregate the results
sum_context_adherence = 0
sum_context_retrieval = 0
sum_context_sufficiency = 0
for index, datapoint in df.iterrows():
sum_context_adherence += datapoint['context_adherence'].metrics[0].value
sum_context_retrieval += datapoint['context_retrieval'].metrics[0].value
sum_context_sufficiency += datapoint['context_sufficiency'].metrics[0].value
print(f"Average Context Adherence: {sum_context_adherence/len(df)}")
print(f"Average Context Retrieval: {sum_context_retrieval/len(df)}")
print(f"Average Context Sufficiency: {sum_context_sufficiency/len(df)}")
Average Context Adherence: 0.9399999999999998
Average Context Retrieval: 0.9
Average Context Sufficiency: 1.0 Was this page helpful?