AI Evaluation for Meeting Summarization
- Taking notes during a meeting can sometimes become challenging, as you have to prioritize between active listening and documenting.
- There are plenty of summarization tools available in the market, but evaluating them quantitatively is the challenge.
- This cookbook will guide you through evaluating meeting summarizations created from transcripts using Future AGI.
- Dataset used here is the transcripts of 1,366 meetings from the city councils of 6 major U.S. cities Paper | Hugging Face
1. Loading Dataset
Loading a dataset in the Future AGI platform is easy. You can either directly upload it as JSON or CSV, or you could import it from Hugging Face. Follow detailed steps on how to add a dataset to Future AGI in the docs.
2. Creating Summary
After successfully loading the dataset, you can see your dataset in the dashboard. Now, click on Run Prompt from top right corner and create prompt to generate summary.
After creating summary of each row, download the dataset using download button from top-right corner.
3. Installing
4. Initialising Client
5. Import Dataset
6. Evaluation
a. Using Future AGI’s Summary Quality Metric
Summary Quality: Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.
b. Using BERT Score
Compares generated response and a reference text using contextual embeddings from pre-trained language models like bert-base-uncased. It calculates precision, recall, and F1 score at the token level, based on cosine similarity between embeddings of each token in the generated response and the reference text.
Result
Summary Column | Avg. Summary Quality | Avg. Precision | Avg. Recall | Avg. F1 |
---|---|---|---|---|
summary-gpt-4o | 0.64 | 0.63 | 0.36 | 0.46 |
summary-gpt-4o-mini | 0.56 | 0.63 | 0.36 | 0.45 |
summary-claude3.5-sonnet | 0.68 | 0.62 | 0.36 | 0.46 |