Skip to content

Commit 09e56d0

Browse files
prvenkjulia-meshcheryakovashanepeckhamritesh-moditanya-borisova
authored
708 replace fuzzywuzzy and textdistance with rapidfuzz for plain evaluation metrics (#709)
In this PR, we modify the following: - fuzz token set ratio from fuzzywuzzy with the much faster rapidfuzz version. we also make this more flexible with an option to measure other kinds of fuzz metrics, namely: ratio, partial ratio, token_sort_ratio, token_sort_partial_ratio, and token_set_partial_ratio. - we replace some plain metrics from textdistance with the rapidfuzz version (this is faster as documented in the issue). the metrics we replace are lccseq, hamming, jarowinkler, levenshtein. We retain textdistance variants for cosine and jaccard given rapidfuzz doesn't have a version. - Added rouge scores as plain string metrics - Renaming variables for clarity, e.g., doc1 to str1 and value1 to str1 given all inputs are supposed to be strings and these were not consistent. - Adding types for arguments and return. - Editing documentation to reflect the changes - Added tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Julia Meshcheryakova <juliame@microsoft.com> Co-authored-by: Shane Peckham <shanepeckham@live.com> Co-authored-by: ritesh.modi <rimod@microsoft.com> Co-authored-by: Tanya Borisova <tborisova@microsoft.com> Co-authored-by: Ross Smith <ross-p-smith@users.noreply.github.com> Co-authored-by: Guy Bertental <gubert@microsoft.com> Co-authored-by: Martin Peck <martinjohnpeck@gmail.com> Co-authored-by: Guy Bertental <guybartal@gmail.com> Co-authored-by: Yuval Yaron <yuvalyaron@microsoft.com> Co-authored-by: Liza Shakury <Liza.Shakury@microsoft.com> Co-authored-by: Yuval Yaron <43217306+yuvalyaron@users.noreply.github.com> Co-authored-by: Liza Shakury <42377481+LizaShak@users.noreply.github.com> Co-authored-by: LizaShak <iliza@outlook.com> Co-authored-by: Shivam Kumar Singh <shivamhere247@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: Karina Cortiñas <79872984+kcortinas@users.noreply.github.com> Co-authored-by: jedheaj314 <51018779+jedheaj314@users.noreply.github.com> Co-authored-by: Vadim Kirilin <vadimkirilin@microsoft.com> Co-authored-by: Vadim Kirilin <vadimkirilin@Vadims-MacBook-Pro.local> Co-authored-by: Matt Luker <luker.matt@gmail.com> Co-authored-by: ScoGroMSFT <scogro@microsoft.com> Co-authored-by: Liam Moat <contact@liammoat.com> Co-authored-by: Liam Moat <liam.moat@microsoft.com> Co-authored-by: Thomas Conté <tconte@microsoft.com>
1 parent 290bd0e commit 09e56d0

File tree

11 files changed

+230
-88
lines changed

11 files changed

+230
-88
lines changed

.github/workflows/config.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,10 @@
7575
},
7676
"eval": {
7777
"metric_types": [
78-
"fuzzy",
78+
"fuzzy_score",
79+
"cosine_ochiai",
80+
"rouge2_recall",
7981
"bert_all_MiniLM_L6_v2",
80-
"cosine",
8182
"bert_distilbert_base_nli_stsb_mean_tokens",
8283
"llm_answer_relevance",
8384
"llm_context_precision"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -351,7 +351,7 @@ Every array will produce the combinations of flat configurations when the method
351351
"temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1."
352352
},
353353
"eval": {
354-
"metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
354+
"metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
355355
}
356356
}
357357
```

config.sample.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,9 @@
7676
},
7777
"eval": {
7878
"metric_types": [
79-
"fuzzy",
79+
"fuzzy_score",
8080
"bert_all_MiniLM_L6_v2",
81-
"cosine",
81+
"cosine_ochiai",
8282
"bert_distilbert_base_nli_stsb_mean_tokens",
8383
"llm_answer_relevance",
8484
"llm_context_precision"

config.schema.json

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -556,12 +556,21 @@
556556
"enum": [
557557
"lcsstr",
558558
"lcsseq",
559-
"cosine",
560559
"jaro_winkler",
561560
"hamming",
562561
"jaccard",
563562
"levenshtein",
564-
"fuzzy",
563+
"fuzzy_score",
564+
"cosine_ochiai",
565+
"rouge1_precision",
566+
"rouge1_recall",
567+
"rouge1_fmeasure",
568+
"rouge2_precision",
569+
"rouge2_recall",
570+
"rouge2_fmeasure",
571+
"rougeL_precision",
572+
"rougeL_recall",
573+
"rougeL_fmeasure",
565574
"bert_all_MiniLM_L6_v2",
566575
"bert_base_nli_mean_tokens",
567576
"bert_large_nli_mean_tokens",

docs/evaluation-metrics.md

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,16 @@ You can choose which metrics should be calculated in your experiment by updating
2121
"hamming",
2222
"jaccard",
2323
"levenshtein",
24-
"fuzzy",
24+
"fuzzy_score",
25+
"rouge1_precision",
26+
"rouge1_recall",
27+
"rouge1_fmeasure",
28+
"rouge2_precision",
29+
"rouge2_recall",
30+
"rouge2_fmeasure",
31+
"rougeL_precision",
32+
"rougeL_recall",
33+
"rougeL_fmeasure",
2534
"bert_all_MiniLM_L6_v2",
2635
"bert_base_nli_mean_tokens",
2736
"bert_large_nli_mean_tokens",
@@ -100,15 +109,34 @@ of elements in the union of the two sets.
100109
The Levenshtein distance is a measure of similarity between two strings. The Levenshtein distance is calculated as the
101110
minimum number of insertions, deletions, or substitutions required to transform one string into the other.
102111

103-
### FuzzyWuzzy similarity
112+
### RapidFuzz similarity
104113

105114
| Configuration Key | Calculation Base | Possible Values |
106115
| ----------------- | -------------------- | --------------------- |
107-
| `fuzzy` | `actual`, `expected` | Integer (Fuzzy score) |
116+
| `fuzzy_score` | `actual`, `expected` | Percentage (0 - 100) |
108117

109-
This metric is backed by the [FuzzyWuzzy Python package](https://pypi.org/project/fuzzywuzzy/).
118+
This metric is backed by the [RapidfFuzz Python package](https://github.com/rapidfuzz/RapidFuzz).
110119
Calculates the fuzzy score between two documents using the levenshtein distance.
111120

121+
### Rouge retrieval metrics (Token based)
122+
123+
**Rouge** short for Recall-Oriented Understudy for Gisting Evaluation, is typically used in summarization evaluation tasks, comparing human produced references and system generated summaries. The core idea is to compare and validate sufficient overlap of common words or phrases in both reference and prediction. String metrics look at character level differences, whereas Rouge can help us compare token level matches. We use the [`rouge-score`](https://pypi.org/project/rouge-score/) to compute these measures. Here are some of the metrics we capture.
124+
125+
| Configuration Key | Calculation Base | Possible Values |
126+
| -------------------------------------------- | ---------------------------- | --------------------- |
127+
| `rouge{1 \| 2 \| L}_{precision \| recall \| fmeasure}` | `ground_truth`, `prediction` | Percentage (0 - 100) |
128+
129+
130+
- **rouge1_precision**: The ROUGE-1 precision score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the ground_truth string.
131+
- **rouge1_recall**: The ROUGE-1 recall score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
132+
- **rouge1_fmeasure**: This is the harmonic mean of the ROUGE-1 precision and recall scores.
133+
- **rouge2_precision**: The ROUGE-2 precision score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the ground_truth string.
134+
- **rouge2_recall**: The ROUGE-2 recall score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the predicted string.
135+
- **rouge2_fmeasure**: This is the harmonic mean of the ROUGE-2 precision and recall scores.
136+
- **rougeL_precision**: The ROUGE-L precision score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
137+
- **rougeL_recall**: The ROUGE-L recall score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the ground truth string.
138+
- **rougeL_fmeasure**: This is the harmonic mean of the ROUGE-L precision and recall scores.
139+
112140
## BERT-based semantic similarity
113141

114142
The following set of metrics calculates semantic similarity between two strings as percentage of differences based on

promptflow/rag-experiment-accelerator/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ az ml environment create --file ./environment.yaml -w $MLWorkSpaceName
117117
"cross_encoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base",
118118
"search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]",
119119
"retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index",
120-
"metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2 llm_context_precision, llm_answer_relevance. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens']",
120+
"metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_context_precision, llm_answer_relevance. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens']",
121121
"azure_oai_chat_deployment_name": "determines the Azure OpenAI chat deployment name",
122122
"azure_oai_eval_deployment_name": "determines the Azure OpenAI evaluation deployment name",
123123
"embedding_model_name": "embedding model name",

rag_experiment_accelerator/config/eval_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
class EvalConfig(BaseConfig):
77
metric_types: list[str] = field(
88
default_factory=lambda: [
9-
"fuzzy",
9+
"fuzzy_score",
1010
"bert_all_MiniLM_L6_v2",
11-
"cosine",
11+
"cosine_ochiai",
1212
"bert_distilbert_base_nli_stsb_mean_tokens",
1313
"llm_answer_relevance",
1414
"llm_context_precision",

rag_experiment_accelerator/evaluation/eval.py

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,31 @@ def compute_metrics(
5656
metric_type (str): The type of metric to use for comparison. Valid options are:
5757
- "lcsstr": Longest common substring
5858
- "lcsseq": Longest common subsequence
59-
- "cosine": Cosine similarity (Ochiai coefficient)
6059
- "jaro_winkler": Jaro-Winkler distance
6160
- "hamming": Hamming distance
6261
- "jaccard": Jaccard similarity
6362
- "levenshtein": Levenshtein distance
64-
- "fuzzy": FuzzyWuzzy similarity
63+
- "fuzzy_score": RapidFuzz similarity. This is faster than the associated function in FuzzyWuzzy.
64+
Default match type is "token_set_ratio".
65+
- "cosine_ochiai": Cosine similarity (Ochiai coefficient)
66+
- "rouge1_precision": The ROUGE-1 precision score. This is the number of overlapping unigrams
67+
between the actual and expected strings divided by the number of unigrams
68+
in the expected string.
69+
- "rouge1_recall": The ROUGE-1 recall score. This is the number of overlapping unigrams between
70+
the actual and expected strings divided by the number of unigrams in the actual string.
71+
- "rouge1_fmeasure": ROUGE-1 F1 score. This is the harmonic mean of the ROUGE-1 precision and recall scores.
72+
- "rouge2_precision": The ROUGE-2 precision score. This is the number of overlapping bigrams between
73+
the actual and expected strings divided by the number of bigrams in the expected string.
74+
- "rouge2_recall": The ROUGE-2 recall score. This is the number of overlapping bigrams between the actual
75+
and expected strings divided by the number of bigrams in the actual string.
76+
- "rouge2_fmeasure": ROUGE-2 F1 score. This is the harmonic mean of the ROUGE-2 precision and recall scores.
77+
- "rougeL_precision": The ROUGE-L precision score is the length of overlapping longest common subsequence
78+
between the actual and expected strings divided by the number of unigrams
79+
in the predicted string.
80+
- "rougeL_recall": The ROUGE-L recall score is the length of overlapping longest common subsequence
81+
between the actual and expected strings divided by the number of unigrams in the
82+
actual string.
83+
- "rougeL_fmeasure": ROUGE-L F1 score. This is the harmonic mean of the ROUGE-L precision and recall scores.
6584
- "bert_all_MiniLM_L6_v2": BERT-based semantic similarity (MiniLM L6 v2 model)
6685
- "bert_base_nli_mean_tokens": BERT-based semantic similarity (base model, mean tokens)
6786
- "bert_large_nli_mean_tokens": BERT-based semantic similarity (large model, mean tokens)
@@ -82,9 +101,12 @@ def compute_metrics(
82101
float: The similarity score between the two strings, as determined by the specified metric.
83102
"""
84103

85-
plain_metric_func = getattr(plain_metrics, metric_type, None)
86-
if plain_metric_func:
87-
return plain_metric_func(actual, expected)
104+
if metric_type.startswith("rouge"):
105+
return plain_metrics.rouge_score(ground_truth=expected, prediction=actual, rouge_metric_name=metric_type)
106+
else:
107+
plain_metric_func = getattr(plain_metrics, metric_type, None)
108+
if plain_metric_func:
109+
return plain_metric_func(actual, expected)
88110

89111
try:
90112
score = compute_transformer_based_score(actual, expected, metric_type)

0 commit comments

Comments
 (0)