• Taking notes during a meeting can sometimes become challenging, as you have to prioritize between active listening and documenting.
  • There are plenty of summarization tools available in the market, but evaluating them quantitatively is the challenge.
  • This cookbook will guide you through evaluating meeting summarizations created from transcripts using Future AGI.
  • Dataset used here is the transcripts of 1,366 meetings from the city councils of 6 major U.S. cities Paper | Hugging Face

1. Loading Dataset

Loading a dataset in the Future AGI platform is easy. You can either directly upload it as JSON or CSV, or you could import it from Hugging Face. Follow detailed steps on how to add a dataset to Future AGI in the docs.

2. Creating Summary

After successfully loading the dataset, you can see your dataset in the dashboard. Now, click on Run Prompt from top right corner and create prompt to generate summary.

After creating summary of each row, download the dataset using download button from top-right corner.

3. Installing

!pip install futureagi==0.3.5

4. Initialising Client

import os
import fi
from fi.evals import EvalClient
from google.colab import userdata

os.environ["FI_API_KEY"] = userdata.get("fi_api_key")
os.environ["FI_SECRET_KEY"] = userdata.get("fi_secret_key")

evaluator = EvalClient(
    fi_api_key=os.getenv("FI_API_KEY"),
    fi_secret_key=os.getenv("FI_SECRET_KEY"),
    fi_base_url="https://api.futureagi.com"
)

5. Import Dataset

import pandas as pd

dataset = pd.read_csv("meeting-summary.csv", encoding='utf-8', on_bad_lines='skip')

6. Evaluation

a. Using Future AGI’s Summary Quality Metric

Summary Quality: Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.

from fi.testcases import TestCase
from fi.evals.templates import SummaryQuality

def evaluate_summary_quality(dataset, summary_column_name):
    scores = []

    for _, row in dataset.iterrows():
        test_case = TestCase(
            input=row["source"],
            output=row[summary_column_name],
            context=row["reference"]
        )
        template = SummaryQuality(config={"check_internet": False})
        response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

        score = response.eval_results[0].metrics[0].value
        scores.append(score)

    average_score = sum(scores) / len(scores) if scores else 0

    combined_results.append({
        "Summary Column": summary_column_name,
        "Avg. Summary Quality": average_score
    })

b. Using BERT Score

Compares generated response and a reference text using contextual embeddings from pre-trained language models like bert-base-uncased. It calculates precision, recall, and F1 score at the token level, based on cosine similarity between embeddings of each token in the generated response and the reference text.

!pip install bert_score
from bert_score import score

def evaluate_bertscore(dataset, summary_column_name):

    temp_results = []
    for _, row in dataset.iterrows():
        source = row["source"]
        summary = row[summary_column_name]

        P, R, F1 = score([summary], [source], model_type="bert-base-uncased", lang="en", verbose=False)

        temp_results.append({
            "bert_precision": P.mean().item(),
            "bert_recall": R.mean().item(),
            "bert_f1": F1.mean().item()
        })

    results_df = pd.DataFrame(temp_results)
    average_p = results_df["bert_precision"].mean()
    average_r = results_df["bert_recall"].mean()
    average_f1 = results_df["bert_f1"].mean()

    combined_results[-1].update({
        "Avg. Precision": average_p,
        "Avg. Recall": average_r,
        "Avg. F1": average_f1
    })

Result

combined_results = []
summary_columns = ["summary-gpt-4o", "summary-gpt-4o-mini", "summary-claude3.5-sonnet"]

for column in summary_columns:
    print(f"Evaluating Summary Quality for {column}...")
    evaluate_summary_quality(dataset, column)

    print(f"Evaluating BERTScore for {column}...")
    evaluate_bertscore(dataset, column)
    print()
Evaluating Summary Quality for summary-gpt-4o...
Evaluating BERTScore for summary-gpt-4o...

Evaluating Summary Quality for summary-gpt-4o-mini...
Evaluating BERTScore for summary-gpt-4o-mini...

Evaluating Summary Quality for summary-claude3.5-sonnet...
Evaluating BERTScore for summary-claude3.5-sonnet...
from tabulate import tabulate

combined_results_df = pd.DataFrame(combined_results)

for col in ["Avg. Summary Quality", "Avg. Precision", "Avg. Recall", "Avg. F1"]:
    if col in combined_results_df.columns:
        combined_results_df[col] = combined_results_df[col].apply(lambda x: f"{x:.2f}")
    else:
        print(f"Warning: Column {col} not found in the dataframe")

print(tabulate(
    combined_results_df,
    headers='keys',
    tablefmt='fancy_grid',
    showindex=False,
    colalign=("left", "center", "center", "center", "center")
))
Summary ColumnAvg. Summary QualityAvg. PrecisionAvg. RecallAvg. F1
summary-gpt-4o0.640.630.360.46
summary-gpt-4o-mini0.560.630.360.45
summary-claude3.5-sonnet0.680.620.360.46