Choosing ML Models
This section will walk through a very specific approach to select, run and evaluate the results of ML models. We're starting with a small Language Model to illustrate this building block. It's highly recommended to follow along inside Colab so you can run the examples. All examples will run within Colab's free T4 GPU constraints.
First Steps With Language Models
ai-guide-pick-a-model-test-a-model
How do I pick a model?¶
Unlike other guides, this one is designed to help pick the right model for whatever task you're trying to do, by:
- teaching you how to always remain on the bleeding edge of published AI research
- broadening your perspective on current open options for any given task
- not be tied to a closed-source / closed-data large language model (ex OpenAI, Anthropic)
- creating a data-led system for always identifying and using the state-of-the-art (SOTA) model for any particular task.
We're going to hone in on "text summarization" as our first task.
So... why are we not using one of the popular large language models?¶
Great question. Most available LLMs worth their salt can do many tasks, including summarization, but not all of them may be good at what specifically you want them to do. We should figure out how to evaluate whether they actually can or not.
Also, many of the current popular LLMs are not open, are trained on undisclosed data and exhibit biases. Responsible AI use requires careful choices, and we're here to help you make them.
Finally, most large LLMs require powerful GPU compute to use. While there are many models that you can use as a service, most of them cost money per API call. Unnecessary when some of the more common tasks can be done at good quality with already available open models and off-the-shelf hardware.
Why does using open models matter?¶
Over the last few decades, engineers have been blessed with being able to onboard by starting with open source projects, and eventually shipping open source to production. This default state is now at risk.
Yes, there are many open models available that do a great job. However, most guides don't discuss how to get started with them using simple steps and instead bias towards existing closed APIs.
Funding is flowing to commercial AI projects, who have larger budgets than open source contributors to market their work, which inevitably leads to engineers starting with closed source projects and shipping expensive closed projects to production.
Our First Project - Summarization¶
We're going to:
- Find text to summarize.
- Figure out how to summarize them using the current state-of-the-art open source models.
- Write some code to do so.
- Evaluate quality of results using relevant metrics
For simplicity's sake, let's grab Mozilla's Trustworthy AI Guidelines in string form
Note that in the real world, you will likely have use other libraries to extract content for any particular file type.
In [ ]:
import textwrap
content = """Mozilla's "Trustworthy AI" Thinking Points:
PRIVACY: How is data collected, stored, and shared? Our personal data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable people to decide how their data is used and what decisions are made with it.
FAIRNESS: We’ve seen time and again how bias shows up in computational models, data, and frameworks behind automated decision making. The values and goals of a system should be power aware and seek to minimize harm. Further, AI systems that depend on human workers should protect people from exploitation and overwork.
TRUST: People should have agency and control over their data and algorithmic outputs, especially considering the high stakes for individuals and societies. For instance, when online recommendation systems push people towards extreme, misleading content, potentially misinforming or radicalizing them.
SAFETY: AI systems can carry high risk for exploitation by bad actors. Developers need to implement strong measures to protect our data and personal security. Further, excessive energy consumption and extraction of natural resources for computing and machine learning accelerates the climate crisis.
TRANSPARENCY: Automated decisions can have huge personal impacts, yet the reasons for decisions are often opaque. We need to mandate transparency so that we can fully understand these systems and their potential for harm."""
Great. Now we're ready to start summarizing.
A brief pause for context.¶
The AI space is moving so fast that it requires a tremendous amount of catching up on scientific papers each week to understand the lay of the land and the state of the art.
It's some effort for an engineer who is brand new to AI to:
- discover which open models are even out there
- which models are appropriate for any particular task
- which benchmarks are used to evaluate those models
- which models are performing well based on evaluations
- which models can actually run on available hardware
For the working engineer on a deadline, this is problematic. There's not much centralized discourse on working with open source AI models. Instead there are fragmented X (formerly Twitter) threads, random private groups and lots of word-of-mouth transfer.
However, once we have a workflow to address all of the above, you will have the means to forever be on the bleeding age of published AI research.
How do I get a list of available open summarization models?¶
For now, we recommend Huggingface and their large directory of open models broken down by task. This is a great starting point. Note that larger LLMs are also included in these lists, so we will have to filter.
In this huge list of summarization models, which ones do we choose?
We don't know what any of these models are trained on. For example, a summarizer trained on news articles vs Reddit posts will perform better on news articles.
What we need is a set of metrics and benchmarks that we can use to do apples-to-apples comparisons of these models.
How do I evaluate summarization models?¶
These steps below can be used to evaluate any available model for any task. It requires hopping between a few sources of data for now, but we will be making this a lot easier moving forward.
Steps:
- Find the most common datasets used to train models for summarization.
- Find the most common metrics used to evaluate models for summarization across those datasets.
- Do a quick audit on training data provenance, quality and any exhibited biases, to keep in line with Responsible AI usage.
Finding datasets¶
The easiest way to do this is using Papers With Code, an excellent resource for finding the latest scientific papers by task that also have code repositories attached.
First, filter Papers With Code's "Text Summarization" datasets by most cited text-based English datasets.
Let's pick (as of this writing) the most cited dataset -- the "CNN/DailyMail" dataset. Usually most cited is one marker of popularity.
Now, you don't need to download this dataset. But we're going to review the info Papers With Code have provided to learn more about it for the next step. This dataset is also available on Huggingface.
You want to check the following:
- license
- recent papers
- whether the data is traceable and the methods are transparent
First, check the license. In this case, it's MIT licensed, which means it can be used for both commercial and personal projects.
Next, see if the papers using this dataset are recent. You can do this by sorting Papers in descending order. This particular dataset has many papers from 2023 - great!
Finally, let's check whether the data is from a credible source. In this case, the dataset was generated by IBM in partnership with the University of Montréal. Great.
Now, let's dig into how we can evaluate models that use this dataset.
Evaluating models¶
Next, we look for measured metrics that are common across datasets for the summarization task. BUT, if you're not familiar with the literature on summarization, you have no idea what those are.
To find out, pick a "Subtask" that's close to what you'd like to see. We'd like to summarize the CNN article we pulled down above, so let's choose "Abstractive Text Summarization".
Now we're in business! This page contains a significant amount of new information.
There are mentions of three new terms: ROUGE-1, ROUGE-2 and ROUGE-L. These are the metrics that are used to measure summarization performance).
There are also a list of models and their scores on these three metrics - this is exactly what we're looking for.
Assuming we're looking at ROUGE-1 as our metric, we now have the top 3 models that we can evaluate in more detail. All 3 are close to 50, which is a promising ROUGE score (read up on ROUGE).
Testing out a model¶
OK, we have a few candidates, so let's pick a model that will run on our local machines. Many models get their best performance when running on GPUs, but there are many that also generate summaries fast on CPUs. Let's pick one of those to start - Google's Pegasus.
In [ ]:
# first we install huggingface's transformers library
%pip install transformers sentencepiece
Then we find Pegasus on Huggingface. Note that part of the datasets Pegasus was trained on includes CNN/DailyMail which bodes well for our article summarization. Interestingly, there's a variant of Pegasus from google that's only trained on our dataset of choice, we should use that.
In [ ]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
# Set the seed, this will help reproduce results. Changing the seed will
# generate new results
from transformers import set_seed
set_seed(248602)
# We're using the version of Pegasus specifically trained for summarization
# using the CNN/DailyMail dataset
model_name = "google/pegasus-cnn_dailymail"
# If you're following along in Colab, switch your runtime to a
# T4 GPU or other CUDA-compliant device for a speedup
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the tokenizer
tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Load the model
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
In [ ]:
# Tokenize the entire content
batch = tokenizer(content, padding="longest", return_tensors="pt").to(device)
# Generate the summary as tokens
summarized = model.generate(**batch)
# Decode the tokens back into text
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
# Compare
def compare(original, summarized_text):
print(f"Article text length: {len(original)}\n")
print(textwrap.fill(summarized_text, 100))
print()
print(f"Summarized length: {len(summarized_text)}")
compare(content, summarized_text)
Article text length: 1427
Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system
should be power aware and seek to minimize harm.<n>People should have agency and control over their
data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.
Summarized length: 320
Alright, we got something! Kind of short though. Let's see if we can make the summary longer...
In [ ]:
set_seed(860912)
# Generate the summary as tokens, with a max_new_tokens
summarized = model.generate(**batch, max_new_tokens=800)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system
should be power aware and seek to minimize harm.<n>People should have agency and control over their
data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.
Summarized length: 320
Well, that didn't really work. Let's try a different approach called 'sampling'. This allows the model to pick the next word according to its conditional probability distribution (specifically, the probability that said word follows the word before).
We'll also be setting the 'temperature'. This variable works to control the levels of randomness and creativity in the generated output.
In [ ]:
set_seed(118511)
summarized = model.generate(**batch, do_sample=True, temperature=0.8, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data.
Summarized length: 193
Shorter, but the quality is higher. Adjusting the temperature up will likely help.
In [ ]:
set_seed(108814)
summarized = model.generate(**batch, do_sample=True, temperature=1.0, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.<n>We need to mandate transparency so that we can fully understand these systems
and their potential for harm.
Summarized length: 325
Now let's play with one other generation approach called top_k sampling -- instead of considering all possible next words in the vocabulary, the model only considers the top 'k' most probable next words.
This technique helps to focus the model on likely continuations and reduces the chances of generating irrelevant or nonsensical text.
It strikes a balance between creativity and coherence by limiting the pool of next word choices, but not so much that the output becomes deterministic.
In [ ]:
set_seed(226012)
summarized = model.generate(**batch, do_sample=True, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points look at ethical issues surrounding automated decision
making.<n>values and goals of a system should be power aware and seek to minimize harm.<n>People
should have agency and control over their data and algorithmic outputs.<n>Developers need to
implement strong measures to protect our data and personal security.
Summarized length: 355
Finally, let's try top_p sampling -- also known as nucleus sampling, is a strategy where the model considers only the smallest set of top words whose cumulative probability exceeds a threshold 'p'.
Unlike top-k which considers a fixed number of words, top-p adapts based on the distribution of probabilities for the next word. This makes it more dynamic and flexible. It helps create diverse and sensible text by allowing less probable words to be selected when the most probable ones don't add up to 'p'.
In [ ]:
set_seed(21420041)
summarized = model.generate(**batch, do_sample=True, top_p=0.9, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
# saving this for later.
pegasus_summarized_text = summarized_text
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.<n>We need to mandate transparency so that we can fully understand these systems
and their potential for harm.
Summarized length: 325
Now, let's try out another model -- Meta's "BART".
Looking at the PapersWithCode graph, BART has solid results with ROUGE-1.
Similar to Pegasus, BART has a custom version finetuned on CNN data.
In [ ]:
from transformers import BartTokenizer, BartForConditionalGeneration
set_seed(120986)
bart_model_name = "facebook/bart-large-cnn"
# Load the tokenizer
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
# Load the model
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name).to(device)
In [ ]:
# Using the same parameters as Pegasus, let's try running BART
batch = bart_tokenizer(content, padding="longest", return_tensors="pt").to(device)
summarized = bart_model.generate(**batch, do_sample=True, top_p=0.5, top_k=50, max_new_tokens=500)
summarized_decoded = bart_tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
bart_summarized_text = summarized_text
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points: How is data collected, stored, and shared? Our personal
data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable
people to decide how their data is used and what decisions are made with it.
Summarized length: 271
Is this the best that BART can do? Unlikely. We can take this as a starting point to experiment.
You should now have enough of a workflow mapped out to find, select and try these models out for not just summarization but any text-based use-case. Let's start learning and experimenting!
There are many other variables that control the output, but this is a great stopping point to switch over to how to evaluate the results of these and any other model quantitatively for your own use-cases.
Evaluating ML Model Results
ai-guide-evaluate-ml-results
How well is this model performing anyway?¶
In the last section, we learned a lot about how to find the right metrics to evaluate a model's performance, and compare them apples to apples using published data.
The next step is to apply those metrics to your own use-case, driving a more quantitative approach to ML model usage.
First, let's import the same text back in as our training content and re-generate those summaries from the last section.
In [ ]:
import textwrap
content = """Mozilla's "Trustworthy AI" Thinking Points:
PRIVACY: How is data collected, stored, and shared? Our personal data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable people to decide how their data is used and what decisions are made with it.
FAIRNESS: We’ve seen time and again how bias shows up in computational models, data, and frameworks behind automated decision making. The values and goals of a system should be power aware and seek to minimize harm. Further, AI systems that depend on human workers should protect people from exploitation and overwork.
TRUST: People should have agency and control over their data and algorithmic outputs, especially considering the high stakes for individuals and societies. For instance, when online recommendation systems push people towards extreme, misleading content, potentially misinforming or radicalizing them.
SAFETY: AI systems can carry high risk for exploitation by bad actors. Developers need to implement strong measures to protect our data and personal security. Further, excessive energy consumption and extraction of natural resources for computing and machine learning accelerates the climate crisis.
TRANSPARENCY: Automated decisions can have huge personal impacts, yet the reasons for decisions are often opaque. We need to mandate transparency so that we can fully understand these systems and their potential for harm."""
%pip install transformers sentencepiece
from transformers import set_seed
set_seed(248602)
In [ ]:
# Loading up Pegasus and BART again
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from transformers import BartTokenizer, BartForConditionalGeneration
device = "cuda" if torch.cuda.is_available() else "cpu"
In [ ]:
# summarizing using BART
set_seed(120986)
bart_model_name = "facebook/bart-large-cnn"
# Load the tokenizer
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
# Load the model
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name).to(device)
In [ ]:
# fetch BART summary
batch = bart_tokenizer(content, padding="longest", return_tensors="pt").to(device)
summarized = bart_model.generate(**batch, do_sample=True, top_p=0.5, top_k=50, max_new_tokens=500)
summarized_decoded = bart_tokenizer.batch_decode(summarized, skip_special_tokens=True)
bart_summarized_text = summarized_decoded[0]
print(bart_summarized_text)
Mozilla's "Trustworthy AI" Thinking Points: How is data collected, stored, and shared? Trustworthy AI should enable people to decide how their data is used. AI systems that depend on human workers should protect people from exploitation and overwork. The values and goals of a system should be power aware and seek to minimize harm.
In [ ]:
# summarizing using Pegasus
# We're using the version of Pegasus specifically trained for summarization
# using the CNN/DailyMail dataset
model_name = "google/pegasus-cnn_dailymail"
# Load the tokenizer
tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Load the model
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
In [ ]:
# Tokenize the entire content
batch = tokenizer(content, padding="longest", return_tensors="pt").to(device)
# Generate the summary as tokens
summarized = model.generate(**batch)
# Decode the tokens back into text
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
pegasus_summarized_text = summarized_decoded[0]
print(pegasus_summarized_text)
Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system should be power aware and seek to minimize harm.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security.
Great. Now, how do you compare both of these models apples-to-apples in terms of summarization output? We're going to look at the familiar ROUGE-1 metric in more detail.
Using a metric to evaluate model results¶
Typically, models are trained on larger datasets. Some fraction of the dataset is hidden from the model, and the evaluation is done by comparing the expected output from dataset with actual output from the model.
However, this doesn't work that well in our example. We usually will have a new bit of text that we're looking to get the best possible summary for.
So, we will need to build a small dataset to compare our task-specific performance. In this case, let's score using ROUGE-1 how closely our generated summaries match up to one a human would generate.
In [ ]:
# Let's create a human-powered reference:
reference = """
Mozilla's Trustworthy AI principles are Privacy controls over personal data,
minimizing bias and exploitation and maximizing Fairness,
ensuring data is sourced. and used appropriately leading to Trust,
Safety systems to protect from bad actors and environmental harm, and
Transparency to understand these systems in order to to reduce harm"""
Conveniently, Python actually has a ROUGE library that makes our computation easy.
In [ ]:
%pip install rouge
Next, we're going to dig into some specifics around how models are evaluated -- precision, recall and F-score. Let's start by comparing the two summarized texts with this reference by computing the ROUGE-1 score, then we'll dig into the pieces.
In [ ]:
from rouge import Rouge
rouge = Rouge()
# Now let's get the ROUGE scores
pegasus_scores = rouge.get_scores(pegasus_summarized_text, reference)[0]
bart_scores = rouge.get_scores(bart_summarized_text, reference)[0]
OK, now we have some actual numbers to look at.
First off, ROUGE is a metric that measures n-gram overlap. In 'n-gram' models and metrics, 'n' refers to the number of words. So:
- ROUGE-1: n=1, unigram (single word)
- ROUGE-2: n=2, bigram (word pairs)
- ROUGE-L: measures overlap based on longest common subsequence (LCS) between generated and reference summaries
Each ROUGE score has three more components:
- Precision (p): fraction of generated n-grams among reference summary
- Recall (r): fraction of reference n-grams found in generated sumary
- F-Score (f): harmonic mean of precision & recall, the final score
If a model has a ROUGE-1 score of 44, this means the f-score is 0.44, meaning 44% of the unigrams (single words) in the generated and compared summaries in the original datasets match the reference summary.
In [ ]:
# ROUGE-1 Scores
print(f"Pegasus ROUGE-1 Scores: {pegasus_scores['rouge-1']}")
print(f"BART ROUGE-1 Scores: {bart_scores['rouge-1']}")
print()
Pegasus ROUGE-1 Scores: {'r': 0.275, 'p': 0.275, 'f': 0.2749999950000001}
BART ROUGE-1 Scores: {'r': 0.325, 'p': 0.29545454545454547, 'f': 0.3095238045351475}
In [ ]:
# ROUGE-2 Scores
print(f"Pegasus ROUGE-2 Scores: {pegasus_scores['rouge-2']}")
print(f"BART ROUGE-2 Scores: {bart_scores['rouge-2']}")
print()
Pegasus ROUGE-2 Scores: {'r': 0.0625, 'p': 0.061224489795918366, 'f': 0.06185566510362459}
BART ROUGE-2 Scores: {'r': 0.0625, 'p': 0.05660377358490566, 'f': 0.05940593560631353}
In [ ]:
# ROUGE-L Scores
print(f"Pegasus ROUGE-L Scores: {pegasus_scores['rouge-l']}")
print(f"BART ROUGE-L Scores: {bart_scores['rouge-l']}")
print()
Pegasus ROUGE-L Scores: {'r': 0.2, 'p': 0.2, 'f': 0.19999999500000015}
BART ROUGE-L Scores: {'r': 0.3, 'p': 0.2727272727272727, 'f': 0.2857142807256236}
Now, even though both Pegasus & BART have reported ROUGE-1 scores of 40+, our scores here for the Mozilla Trustworthy AI human-generated summary are approximately:
- Pegasus: 27
- BART: 31
While this is below average, note that we now have a reasonable sense of quality based on our expectations.
This means we can iteratively make adjustments to our model text generation to get us closer to the average scores and beyond. But hold on... how do we know that ROUGE can't be easily gamed? Well, unfortunately for us, it can.
Counter-Metrics & Why They Matter¶
ROUGE is an interesting metric in that it requires a human reference summary to compare against, and the final F-score is very closely tied to it.
Let's see what happens when we change the reference summary to something longer...
In [ ]:
reference = "Mozilla's 'Trustworthy AI' is built on five key principles. Privacy emphasizes user control over data collection and usage. Fairness focuses on minimizing bias in computational models, as well as protecting human workers from exploitation. Trust aims to provide individuals with control over their data and the decisions made by algorithms. Safety prioritizes protection against misuse of data, as well as reducing environmental impact. Lastly, Transparency mandates clarity in automated decision-making processes to prevent potential harm."
print(textwrap.fill(reference, 100))
Mozilla's 'Trustworthy AI' is built on five key principles. Privacy emphasizes user control over
data collection and usage. Fairness focuses on minimizing bias in computational models, as well as
protecting human workers from exploitation. Trust aims to provide individuals with control over
their data and the decisions made by algorithms. Safety prioritizes protection against misuse of
data, as well as reducing environmental impact. Lastly, Transparency mandates clarity in automated
decision-making processes to prevent potential harm.
Let's re-run the ROUGE scoring with this reference model...
In [ ]:
# Now let's get the ROUGE scores
pegasus_scores = rouge.get_scores(pegasus_summarized_text, reference)[0]
bart_scores = rouge.get_scores(bart_summarized_text, reference)[0]
# ROUGE-1 Scores
print(f"Pegasus ROUGE-1 Scores: {pegasus_scores['rouge-1']}")
print(f"BART ROUGE-1 Scores: {bart_scores['rouge-1']}")
print()
# ROUGE-2 Scores
print(f"Pegasus ROUGE-2 Scores: {pegasus_scores['rouge-2']}")
print(f"BART ROUGE-2 Scores: {bart_scores['rouge-2']}")
print()
# ROUGE-L Scores
print(f"Pegasus ROUGE-L Scores: {pegasus_scores['rouge-l']}")
print(f"BART ROUGE-L Scores: {bart_scores['rouge-l']}")
print()
Pegasus ROUGE-1 Scores: {'r': 0.140625, 'p': 0.225, 'f': 0.1730769183431954}
BART ROUGE-1 Scores: {'r': 0.203125, 'p': 0.29545454545454547, 'f': 0.24074073591220863}
Pegasus ROUGE-2 Scores: {'r': 0.056338028169014086, 'p': 0.08163265306122448, 'f': 0.06666666183472257}
BART ROUGE-2 Scores: {'r': 0.04225352112676056, 'p': 0.05660377358490566, 'f': 0.04838709187955305}
Pegasus ROUGE-L Scores: {'r': 0.140625, 'p': 0.225, 'f': 0.1730769183431954}
BART ROUGE-L Scores: {'r': 0.203125, 'p': 0.29545454545454547, 'f': 0.24074073591220863}
OK... that's now a huge drop in ROUGE-1 F-scores across both models, which is potentially problematic. The challenge here is that ROUGE doesn't capture the semantic meaning of the text to be summarized, it proxies using n-grams.
So, are there other metrics we can use to provide a more holistic picture? Yes, meet BERTScore.
Using BERTScore for evaluating summary quality¶
BERTScore is a metric that uses contextual embeddings (in code, represented as matrix math, more on this later) to compare semantic similarities between the reference and generated summaries.
According to the original paper, "BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics".
Unlike ROUGE, BERTScore actually generates results using BERT, another ML model, making it computationally more expensive. By default the bert_score library we're going to use will use the roberta-large
BERT model downloaded from Huggingface.
Let's give this a shot.
In [ ]:
%pip install bert_score
In [ ]:
from bert_score import BERTScorer
# Let's setup BERTScorer and score the Pegasus set first
scorer = BERTScorer(lang="en", rescale_with_baseline=True)
In [ ]:
p, r, f1 = scorer.score([pegasus_summarized_text], [reference])
print(f"Pegasus BERTSCore: 'r': {r}, 'p': {p}, 'f': {f1}")
Pegasus BERTSCore: 'r': tensor([0.2065]), 'p': tensor([0.1569]), 'f': tensor([0.1829])
F1 value for BERTScore is 18 for Pegasus.
In [ ]:
p, r, f1 = scorer.score([bart_summarized_text], [reference])
print(f"BART BERTSCore: 'r': {r}, 'p': {p}, 'f': {f1}")
BART BERTSCore: 'r': tensor([0.2861]), 'p': tensor([0.3487]), 'f': tensor([0.3183])
Comparing apples-to-apples, the embedding-powered BERTScore reports F1 scores of 32 for the BART summary, which indicates that as per as this metric goes, BART is a better summary.
So, should we use ROUGE or BERTScore? The answer is - it's not that simple. Depending on your requirements, you should evaluate these and other metrics to decide whether a model is performing well for your use case.
What we can say is -- BERTScore may do a better job capturing meaning, and here are some visuals to help explain.
The bert_score
library we are using allows us to map out how each word in a generated summary is scored against a reference to get a deeper sense of semantic understanding.
In [ ]:
# Let's use two NEW summary sentences to illustrate this effectively.
scorer.plot_example("Hot days forecasted during peak summer, 54 degrees.",
"Summer days mean high temperatures of fifty-four C")
Note the correlations between:
- 'summer' and 'hot'
- 'temperatures' and 'degrees'
- 'fifty' and '54' and
- 'C' and 'degrees'
There's one other metric that we will not be covering but is also well-suited for gauging semantic meaning - QAEval, which uses a similar model-driven approach to scoring but also generates question/answer pairs to score against.
Using this foundation, in the next section focusing on Retrieval, we will be setting up a harness to store & retrieve documents, and monitor & evaluate the quality of our generative systems.