708 replace fuzzywuzzy and textdistance with rapidfuzz for plain evaluation metrics (#709)

prvenk · julia-meshcheryakova · shanepeckham · web-flow · commit 09e56d001dc4 · 2024-10-02T13:50:07.000Z
In this PR, we modify the following:
- fuzz token set ratio from fuzzywuzzy with the much faster rapidfuzz
version. we also make this more flexible with an option to measure other
kinds of fuzz metrics, namely: ratio, partial ratio, token_sort_ratio,
token_sort_partial_ratio, and token_set_partial_ratio.
- we replace some plain metrics from textdistance with the rapidfuzz
version (this is faster as documented in the issue). the metrics we
replace are lccseq, hamming, jarowinkler, levenshtein. We retain
textdistance variants for cosine and jaccard given rapidfuzz doesn't
have a version.
- Added rouge scores as plain string metrics
- Renaming variables for clarity, e.g., doc1 to str1 and value1 to str1
given all inputs are supposed to be strings and these were not
consistent.
- Adding types for arguments and return.
- Editing documentation to reflect the changes
- Added tests

---------

Signed-off-by: dependabot[bot] &lt;support@github.com&gt;
Co-authored-by: Julia Meshcheryakova &lt;juliame@microsoft.com&gt;
Co-authored-by: Shane Peckham &lt;shanepeckham@live.com&gt;
Co-authored-by: ritesh.modi &lt;rimod@microsoft.com&gt;
Co-authored-by: Tanya Borisova &lt;tborisova@microsoft.com&gt;
Co-authored-by: Ross Smith &lt;ross-p-smith@users.noreply.github.com&gt;
Co-authored-by: Guy Bertental &lt;gubert@microsoft.com&gt;
Co-authored-by: Martin Peck &lt;martinjohnpeck@gmail.com&gt;
Co-authored-by: Guy Bertental &lt;guybartal@gmail.com&gt;
Co-authored-by: Yuval Yaron &lt;yuvalyaron@microsoft.com&gt;
Co-authored-by: Liza Shakury &lt;Liza.Shakury@microsoft.com&gt;
Co-authored-by: Yuval Yaron &lt;43217306+yuvalyaron@users.noreply.github.com&gt;
Co-authored-by: Liza Shakury &lt;42377481+LizaShak@users.noreply.github.com&gt;
Co-authored-by: LizaShak &lt;iliza@outlook.com&gt;
Co-authored-by: Shivam Kumar Singh &lt;shivamhere247@gmail.com&gt;
Co-authored-by: dependabot[bot] &lt;49699333+dependabot[bot]@users.noreply.github.com&gt;
Co-authored-by: Ikko Eltociear Ashimine &lt;eltociear@gmail.com&gt;
Co-authored-by: Karina Cortiñas &lt;79872984+kcortinas@users.noreply.github.com&gt;
Co-authored-by: jedheaj314 &lt;51018779+jedheaj314@users.noreply.github.com&gt;
Co-authored-by: Vadim Kirilin &lt;vadimkirilin@microsoft.com&gt;
Co-authored-by: Vadim Kirilin &lt;vadimkirilin@Vadims-MacBook-Pro.local&gt;
Co-authored-by: Matt Luker &lt;luker.matt@gmail.com&gt;
Co-authored-by: ScoGroMSFT &lt;scogro@microsoft.com&gt;
Co-authored-by: Liam Moat &lt;contact@liammoat.com&gt;
Co-authored-by: Liam Moat &lt;liam.moat@microsoft.com&gt;
Co-authored-by: Thomas Conté &lt;tconte@microsoft.com&gt;
diff --git a/.github/workflows/config.json b/.github/workflows/config.json
@@ -75,9 +75,10 @@
     },
     "eval": {
         "metric_types": [
-            "fuzzy",
+            "fuzzy_score",
+            "cosine_ochiai",
+            "rouge2_recall",
             "bert_all_MiniLM_L6_v2",
-            "cosine",
             "bert_distilbert_base_nli_stsb_mean_tokens",
             "llm_answer_relevance",
             "llm_context_precision"
diff --git a/README.md b/README.md
@@ -351,7 +351,7 @@ Every array will produce the combinations of flat configurations when the method
         "temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1."
     },
     "eval": {
-        "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
+        "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
     }
 }
 ```
diff --git a/config.sample.json b/config.sample.json
@@ -76,9 +76,9 @@
     },
     "eval": {
         "metric_types": [
-            "fuzzy",
+            "fuzzy_score",
             "bert_all_MiniLM_L6_v2",
-            "cosine",
+            "cosine_ochiai",
             "bert_distilbert_base_nli_stsb_mean_tokens",
             "llm_answer_relevance",
             "llm_context_precision"
diff --git a/config.schema.json b/config.schema.json
@@ -556,12 +556,21 @@
                         "enum": [
                             "lcsstr",
                             "lcsseq",
-                            "cosine",
                             "jaro_winkler",
                             "hamming",
                             "jaccard",
                             "levenshtein",
-                            "fuzzy",
+                            "fuzzy_score",
+                            "cosine_ochiai",
+                            "rouge1_precision",
+                            "rouge1_recall",
+                            "rouge1_fmeasure",
+                            "rouge2_precision",
+                            "rouge2_recall",
+                            "rouge2_fmeasure",
+                            "rougeL_precision",
+                            "rougeL_recall",
+                            "rougeL_fmeasure",
                             "bert_all_MiniLM_L6_v2",
                             "bert_base_nli_mean_tokens",
                             "bert_large_nli_mean_tokens",
diff --git a/docs/evaluation-metrics.md b/docs/evaluation-metrics.md
@@ -21,7 +21,16 @@ You can choose which metrics should be calculated in your experiment by updating
     "hamming",
     "jaccard",
     "levenshtein",
-    "fuzzy",
+    "fuzzy_score",
+    "rouge1_precision",
+    "rouge1_recall",
+    "rouge1_fmeasure",
+    "rouge2_precision",
+    "rouge2_recall",
+    "rouge2_fmeasure",
+    "rougeL_precision",
+    "rougeL_recall",
+    "rougeL_fmeasure",
     "bert_all_MiniLM_L6_v2",
     "bert_base_nli_mean_tokens",
     "bert_large_nli_mean_tokens",
@@ -100,15 +109,34 @@ of elements in the union of the two sets.
 The Levenshtein distance is a measure of similarity between two strings. The Levenshtein distance is calculated as the
 minimum number of insertions, deletions, or substitutions required to transform one string into the other.
 
-### FuzzyWuzzy similarity
+### RapidFuzz similarity
 
 | Configuration Key | Calculation Base     | Possible Values       |
 | ----------------- | -------------------- | --------------------- |
-| `fuzzy`           | `actual`, `expected` | Integer (Fuzzy score) |
+| `fuzzy_score`           | `actual`, `expected` | Percentage (0 - 100) |
 
-This metric is backed by the [FuzzyWuzzy Python package](https://pypi.org/project/fuzzywuzzy/).
+This metric is backed by the [RapidfFuzz Python package](https://github.com/rapidfuzz/RapidFuzz).
 Calculates the fuzzy score between two documents using the levenshtein distance.
 
+### Rouge retrieval metrics (Token based)
+
+**Rouge** short for Recall-Oriented Understudy for Gisting Evaluation, is typically used in summarization evaluation tasks, comparing human produced references and system generated summaries. The core idea is to compare and validate sufficient overlap of common words or phrases in both reference and prediction. String metrics look at character level differences, whereas Rouge can help us compare token level matches. We use the [`rouge-score`](https://pypi.org/project/rouge-score/) to compute these measures. Here are some of the metrics we capture.
+
+| Configuration Key                            | Calculation Base             | Possible Values       |
+| -------------------------------------------- | ---------------------------- | --------------------- |
+| `rouge{1 \| 2 \| L}_{precision \| recall \| fmeasure}` | `ground_truth`, `prediction` | Percentage (0 - 100)  |
+
+
+- **rouge1_precision**: The ROUGE-1 precision score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the ground_truth string.
+- **rouge1_recall**: The ROUGE-1 recall score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
+- **rouge1_fmeasure**: This is the harmonic mean of the ROUGE-1 precision and recall scores.
+- **rouge2_precision**: The ROUGE-2 precision score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the ground_truth string.
+- **rouge2_recall**: The ROUGE-2 recall score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the predicted string.
+- **rouge2_fmeasure**: This is the harmonic mean of the ROUGE-2 precision and recall scores.
+- **rougeL_precision**: The ROUGE-L precision score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
+- **rougeL_recall**: The ROUGE-L recall score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the ground truth string.
+- **rougeL_fmeasure**: This is the harmonic mean of the ROUGE-L precision and recall scores.
+
 ## BERT-based semantic similarity
 
 The following set of metrics calculates semantic similarity between two strings as percentage of differences based on
diff --git a/promptflow/rag-experiment-accelerator/README.md b/promptflow/rag-experiment-accelerator/README.md
@@ -117,7 +117,7 @@ az ml environment create --file ./environment.yaml -w $MLWorkSpaceName
     "cross_encoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base",
     "search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]",
     "retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index",
-    "metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2 llm_context_precision, llm_answer_relevance. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens']",
+    "metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_context_precision, llm_answer_relevance. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens']",
     "azure_oai_chat_deployment_name":  "determines the Azure OpenAI chat deployment name",
     "azure_oai_eval_deployment_name":  "determines the Azure OpenAI evaluation deployment name",
     "embedding_model_name": "embedding model name",
diff --git a/rag_experiment_accelerator/config/eval_config.py b/rag_experiment_accelerator/config/eval_config.py
@@ -6,9 +6,9 @@
 class EvalConfig(BaseConfig):
     metric_types: list[str] = field(
         default_factory=lambda: [
-            "fuzzy",
+            "fuzzy_score",
             "bert_all_MiniLM_L6_v2",
-            "cosine",
+            "cosine_ochiai",
             "bert_distilbert_base_nli_stsb_mean_tokens",
             "llm_answer_relevance",
             "llm_context_precision",
diff --git a/rag_experiment_accelerator/evaluation/eval.py b/rag_experiment_accelerator/evaluation/eval.py
@@ -56,12 +56,31 @@ def compute_metrics(
         metric_type (str): The type of metric to use for comparison. Valid options are:
             - "lcsstr": Longest common substring
             - "lcsseq": Longest common subsequence
-            - "cosine": Cosine similarity (Ochiai coefficient)
             - "jaro_winkler": Jaro-Winkler distance
             - "hamming": Hamming distance
             - "jaccard": Jaccard similarity
             - "levenshtein": Levenshtein distance
-            - "fuzzy": FuzzyWuzzy similarity
+            - "fuzzy_score": RapidFuzz similarity. This is faster than the associated function in FuzzyWuzzy.
+                             Default match type is "token_set_ratio".
+            - "cosine_ochiai": Cosine similarity (Ochiai coefficient)
+            - "rouge1_precision": The ROUGE-1 precision score. This is the number of overlapping unigrams
+                                  between the actual and expected strings divided by the number of unigrams
+                                  in the expected string.
+            - "rouge1_recall": The ROUGE-1 recall score. This is the number of overlapping unigrams between
+                               the actual and expected strings divided by the number of unigrams in the actual string.
+            - "rouge1_fmeasure": ROUGE-1 F1 score. This is the harmonic mean of the ROUGE-1 precision and recall scores.
+            - "rouge2_precision": The ROUGE-2 precision score. This is the number of overlapping bigrams between
+                                    the actual and expected strings divided by the number of bigrams in the expected string.
+            - "rouge2_recall": The ROUGE-2 recall score. This is the number of overlapping bigrams between the actual
+                               and expected strings divided by the number of bigrams in the actual string.
+            - "rouge2_fmeasure": ROUGE-2 F1 score. This is the harmonic mean of the ROUGE-2 precision and recall scores.
+            - "rougeL_precision": The ROUGE-L precision score is the length of overlapping longest common subsequence
+                                  between the actual and expected strings divided by the number of unigrams
+                                  in the predicted string.
+            - "rougeL_recall": The ROUGE-L recall score is the length of overlapping longest common subsequence
+                               between the actual and expected strings divided by the number of unigrams in the
+                               actual string.
+            - "rougeL_fmeasure": ROUGE-L F1 score. This is the harmonic mean of the ROUGE-L precision and recall scores.
             - "bert_all_MiniLM_L6_v2": BERT-based semantic similarity (MiniLM L6 v2 model)
             - "bert_base_nli_mean_tokens": BERT-based semantic similarity (base model, mean tokens)
             - "bert_large_nli_mean_tokens": BERT-based semantic similarity (large model, mean tokens)
@@ -82,9 +101,12 @@ def compute_metrics(
         float: The similarity score between the two strings, as determined by the specified metric.
     """
 
-    plain_metric_func = getattr(plain_metrics, metric_type, None)
-    if plain_metric_func:
-        return plain_metric_func(actual, expected)
+    if metric_type.startswith("rouge"):
+        return plain_metrics.rouge_score(ground_truth=expected, prediction=actual, rouge_metric_name=metric_type)
+    else:
+        plain_metric_func = getattr(plain_metrics, metric_type, None)
+        if plain_metric_func:
+            return plain_metric_func(actual, expected)
 
     try:
         score = compute_transformer_based_score(actual, expected, metric_type)
diff --git a/rag_experiment_accelerator/evaluation/plain_metrics.py b/rag_experiment_accelerator/evaluation/plain_metrics.py
diff --git a/rag_experiment_accelerator/evaluation/tests/test_plain_metrics.py b/rag_experiment_accelerator/evaluation/tests/test_plain_metrics.py
diff --git a/requirements.txt b/requirements.txt

Original file line number	Diff line number	Diff line change
`@@ -351,7 +351,7 @@ Every array will produce the combinations of flat configurations when the method`
`351`	`351`	`"temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1."`
`352`	`352`	`},`
`353`	`353`	`"eval": {`
`354`		- "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
	`354`	+ "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
`355`	`355`	`}`
`356`	`356`	`}`
`357`	`357`	```