LLM Evaluation metrics : ROUGE

Iman Johari
10 min readJun 22, 2024

--

Introduction

Considering that Large Language Models are being operationalized more than ever, we must ensure that any decision made based on the data generated by AI are based on the correct generated data that can be trusted.

Let’s say we have a medium size business and we provide a service to our clients; we have an email that we receive clients’ feedback; we have a backlog of over 1000 emails that we need to sift through to see if any of these emails are related to a client’s complaint. We can use generative to classify emails to be either complaint or non complaint;If the emails are complaint we will take required actions to improve our services.

The issue would rise if the complaint emails were incorrectly classified as non complaint and what could help us improve our services would be forever hidden due to AI not doing it’s job right. To ensure that our chosen AI is doing a good job we have defined certain measures that ensures our generative AI is doing a decent job and can be trusted!

Depending on the use case there are various measures in place. Let’s introduce and dig a bit deeper into these measures. In this article we are going to base our findings based on evaluation metrics offered by Watsonx. governance.

Watsonx.governance is an AI governance platform offered by IBM that helps AI engineers and data scientists to monitor traditional machine learning models as well as Large Language Models. We are going to focus on LLM metrics only on this article.

Here is a list of metrics supported by Watsonx.governance :

1- Rouge

2- SARI

3- Meteor

4- Text Quality

5- BLEU

6- Sentence Similarity

7- PII

8- HAP

9- Readability

10- Exact match

11- Multi-label/class metrics

There are two questions that I want to explore the answer:

1- Why do we need them ?

2- How they are calculated?

Rouge

Rouge is an open source evaluation metric model available on hugging face; It stands for recall-oriented understudy for Gisting Evaluation. It is not just one metric; It is a set of metrics {Rouge1, Rouge2, RougeL and RougeSum} for evaluating automatic summarization or translation. It’s purpose is how well the generated text is compared to a reference text.

The idea is that we have an original text and we produce a summarization of it manually or through some other method. This becomes our ground truth that AI generated content is compared with. We feed our original text to generative AI and generative AI will generate a summarization. Now we will use a few computations to see how well the content generation has been comparing to our reference summary.

For example Let’s consider the following original text: I played soccer in the park all afternoon.

I will summarize this myself to set the reference summary as : I played soccer.

Let’s say our Generative AI model processed this text and created a summary which is : Played soccer

Let’s explore each rouge metric and see what each of them means.

1- Rouge-1 calculates unigram overlaps; a unigram is word tokenization of size of 1 word.

What is tokenization ? Tokenization is the process of breaking down texts into smaller units called tokens. There are different type of tokenization:

  • Word tokenization : Splitting a piece of text into individual words.
  • Sentence tokenization : Splitting text into individual sentences.
  • Character tokenization : Splitting text into individual characters.
  • Subword tokenization : Breaking words into smaller units, often useful for handling rare or unknown words.

For rouge we‘ll do word tokenization and create a unigram for each summary, then we take each word and put them in a list.

Unigram for reference summary: {I, played, soccer}

  • Count of unigram for reference summary : 3

Unigram for generated summary: {played, soccer}

  • Count of unigrams for generated summary : 2

What is the overlapping unigrams or unigrams that exist in both lists {played, soccer}

  • Count of overlapping unigrams : 2

For Rouge-1 there are 3 components that we can calculate based on the above counts:

  • Precision: This calculates count of overlapping unigrams with respect to count of unigrams of the generated summary. This yields to 2/2 = 1 ; We basically want to see what percentage of unigrams that existed in both reference summary and the generated summary were found in generated summary. This is the percentage of relevant instances among the retrieved instances.
  • Recall : This calculates count of overlapping unigrams with respect to count of unigrams of the reference summary. This yields to 2/3= 0.67; we basically want to see what percentage of unigrams that existed in both reference summary and generated summary were found in reference summary. This is the percentage of relevant instances that were retrieved.
  • F1 Score : is a balanced measure of how well the generated summary matches the reference summary and is calculated by

for above summarization we have these values :

Precision : 1

Recall : 0.25

plugging these numbers to above formula gives us : 0.8

Therefore for Rouge-1 we have the following numbers :

Precision: 1, Recall:0.25 and F1-score: 0.8

2- Rouge-2 calculates bigram overlaps; A bigram is a sequence of two adjacent elements from a string of tokens, which are typically words

Let’s go back to our reference summary and generated summary :

Reference summary: I played soccer

Generated summary : played soccer

Bigrams of reference summary will be : { I played, played soccer }

  • Count of Bigrams of reference summary : 2

Bigrams of generated text will be : {played soccer}

  • Count of Bigrams of generated summary : 1

Bigrams of overlapping bigrams : { played soccer}

  • Count of overlapping bigrams : 1

Now lets calculate Precision, Recall and F1 score for Bigrams based on the formula given in Route-1:

Precision : 1, Recall : 0.5, F1-score : 0.67

3- Rouge-L calculates the longest common subsequence; The longest common subsequence between reference summary (I played soccer) and generated summary (played soccer) is : played soccer

For Rouge-L we need to have :

Length of the Longest common subsequence (LCS) : 2

Total words in reference summary : 3

Total words in generated summary : 2

Precision is calculated as

which yields to: 1

Recall is calculated as

which is: 0.67

F1-score is calculated

which yields to 0.80

Precision : 1, Recall : 0.67, F1-score : 0.80

4- Rouge-LSum is an extension of Rouge-L and it is more suitable for multiple sentences or paragraph level summaries whereas Rouge-L is more suitable for shorter text snippets, sentences or single sentence summaries. In Rouge-LSum the LCS is calculated across multiple lines and precision, recall and F1-score is calculated based on that.

Reference Summary : Today was a hot day. I went swimming and ate.

Generated Summary : in a hot day I went swimming and ate.

  • Count of reference summary tokens : 10
  • Count of generated summary tokens: 9
  • Longest common subsequence : a hot day I went swimming and ate
  • count of LCS tokens : 8

Precision = 8 / 9 = 0.89

recall = 8 / 10 = 0.8

F1-score = 0.84

Precision : 0.89, Recall : 0.8, F1-score : 0.84

Now that we understand the theory, let’s implement them.

Implementation

The first step is to install rouge on your development machine. please follow this link to install rouge from google : https://github.com/google-research/google-research/tree/master/rouge

Once installed we can calculate rouge metrics with a few lines of code.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
'The quick brown dog jumps on the log.')

Let’s go a step further. Let’s create a dashboard within which we can type our reference summary and generated summary and we could call a REST service to return results of Rouge metrics to user interface. This is implemented in the git here : https://github.com/ijgitsh/rougeExperimental

This code snippet which is less that 50 lines of code will calculate all the Rouge metrics

from rouge_score import rouge_scorer
from flask import Flask, jsonify, request
from flask_cors import CORS
import requests
import config
from collections import namedtuple

from flask import Flask, request, jsonify
from collections import namedtuple
from rouge_score import rouge_scorer

app = Flask(__name__)
CORS(app)
# Named tuple for Score
Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])

def score_to_dict(score):
return {
'precision': score.precision,
'recall': score.recall,
'fmeasure': score.fmeasure
}

@app.route('/rouge', methods=['POST'])
def get_prompts():
data = request.get_json()
print(f"Received data: {data}") # Log received data
reference = data.get('reference', '')
prediction = data.get('prediction', '')

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer=True)
scores = scorer.score(reference, prediction)
print(f"Calculated scores: {scores}") # Log calculated scores

rouge_scores_dict = {key: score_to_dict(value) for key, value in scores.items()}
print(f"ROUGE scores dict: {rouge_scores_dict}") # Log the scores dictionary
return jsonify(rouge_scores_dict)

if __name__ == '__main__':
app.run(debug=True)

And a simple user interface can show all the metrics from the REST service ; you can setup and “ROUGE Score Dashboard” app in this Github repository: https://github.com/ijgitsh/rougeExperimental

Watsonx.Governance

Now that we know how to calculate this metric and showcase it for couple of records on a simple dashboard, let’s see how we can utilize this metric in the context of AI governance in a larger scale; One of the main pillars of AI governance is model evaluation. The evaluation is measured in the context of an AI use case. An AI use case is a use for one of the generative AI outcomes weather it is summarization, entity extraction, questioning and answering and etc.

In Watsonx.governance we can create these use cases for internal models and external models; We will create an AI use case for an internal models. These use cases are stored in a model use case inventory and later can be monitored by AI governance team to ensure if they are associated with any risks or compliant with our rules and regulations.

Let’s take a step back; what is a summarization in Generative AI? A summarization is an instruction to tell the LLM to generate the data in a desired fashion which is shorter but includes certain details. That instruction as you know, is called a prompt. A prompt is the central point in Generative AI and metrics are evaluated around a prompt.

We create and store prompts in our project workspace and track them in an AI use case.

Prompts in Watsonx.governance from creation to consumption

Let’s Examine what a prompt has or associated with in Watsonx.Governance?

1- A prompt is associated with a Foundation Model supported in Watsonx.ai platform, and each LLM has various parameters

2- A prompt is associated with an AI use case; the AI use case shows the lifecycle of the prompt from initiation to deployment on production; in addition it also contains versioning of the prompt through it’s life cycle. An AI use case can also have various approaches to help solve the same problem with Generative AI e.g. use a different LLM with a different parameter for the same prompt and compare the two approaches

3- A prompt has various metadata that are captured in the AI use case. Metadata such as purpose, description, any supporting documents, etc.

4- A prompt has access control e..g control who can modify prompts and impact the results of a prompt

5- A prompt can be deployed as a service on a deployment space to be served for any application that uses LLMs

6- A prompt is associated with various evaluation metrics that shows how well it performs for the AI use case it is made for

In Watsonx.Governance we can view all the AI use cases that we have been given view permission; These AI use cases track hosted Watsonx.Ai foundation models or the ones which are hosted external to the platform e.g. chat-gpt models or on google vertex ai.

We can capture a lot of built-in and custom metadata with respect to our AI use case

We can govern our AI use case through the life cycle of prompt from development, validation to operationalization. We can also have various approaches for the same use case such as using a different LLM and being able to compare various approaches.

A dashboard that shows various metrics set up for evaluation of various AI use cases. As you can see in below image we have 4 rouge values as well as a violation value. All the Rouge values represent the average F1-scores for each corresponding Rouge type. In addition to Rouge you can observe that there are additional metrics for evaluation as well such as SARI, METEOR, Bleu etc.

For each metric, the AI Engineer or data scientist can drill down at the transaction level and understand why a certain evaluation failed with respect to that transaction. They can view the score distribution through the evaluated data and download the calculated metrics at the record level.

In the end Model evaluation is a part of a bigger story of AI governance. IBM AI Governance gives a consolidated view across various AI projects for

  • Risk management
  • Lifecycle governance
  • Evaluation and monitoring

and serves users such as

  • Data scientists, prompt Engineers, Ai Engineers : users who deal with lower level metrics
  • Model ops engineer
  • Head of Enterprise data platform, risk and compliance managers

Resources

IBM Watsonx.Governance : https://www.ibm.com/products/watsonx-governance

Rouge by google : https://github.com/google-research/google-research/tree/master/rouge

Rouge by hugging face : https://huggingface.co/spaces/evaluate-metric/rouge

Supported evaluation metrics by Watsonx.governance https://dataplatform.cloud.ibm.com/docs/content/wsj/model/wos-monitor-gen-quality.html?context=cpdaas

Github repository : https://github.com/ijgitsh/rougeExperimental

--

--

No responses yet