You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
708 replace fuzzywuzzy and textdistance with rapidfuzz for plain evaluation metrics (#709)
In this PR, we modify the following:
- fuzz token set ratio from fuzzywuzzy with the much faster rapidfuzz
version. we also make this more flexible with an option to measure other
kinds of fuzz metrics, namely: ratio, partial ratio, token_sort_ratio,
token_sort_partial_ratio, and token_set_partial_ratio.
- we replace some plain metrics from textdistance with the rapidfuzz
version (this is faster as documented in the issue). the metrics we
replace are lccseq, hamming, jarowinkler, levenshtein. We retain
textdistance variants for cosine and jaccard given rapidfuzz doesn't
have a version.
- Added rouge scores as plain string metrics
- Renaming variables for clarity, e.g., doc1 to str1 and value1 to str1
given all inputs are supposed to be strings and these were not
consistent.
- Adding types for arguments and return.
- Editing documentation to reflect the changes
- Added tests
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Julia Meshcheryakova <juliame@microsoft.com>
Co-authored-by: Shane Peckham <shanepeckham@live.com>
Co-authored-by: ritesh.modi <rimod@microsoft.com>
Co-authored-by: Tanya Borisova <tborisova@microsoft.com>
Co-authored-by: Ross Smith <ross-p-smith@users.noreply.github.com>
Co-authored-by: Guy Bertental <gubert@microsoft.com>
Co-authored-by: Martin Peck <martinjohnpeck@gmail.com>
Co-authored-by: Guy Bertental <guybartal@gmail.com>
Co-authored-by: Yuval Yaron <yuvalyaron@microsoft.com>
Co-authored-by: Liza Shakury <Liza.Shakury@microsoft.com>
Co-authored-by: Yuval Yaron <43217306+yuvalyaron@users.noreply.github.com>
Co-authored-by: Liza Shakury <42377481+LizaShak@users.noreply.github.com>
Co-authored-by: LizaShak <iliza@outlook.com>
Co-authored-by: Shivam Kumar Singh <shivamhere247@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: Karina Cortiñas <79872984+kcortinas@users.noreply.github.com>
Co-authored-by: jedheaj314 <51018779+jedheaj314@users.noreply.github.com>
Co-authored-by: Vadim Kirilin <vadimkirilin@microsoft.com>
Co-authored-by: Vadim Kirilin <vadimkirilin@Vadims-MacBook-Pro.local>
Co-authored-by: Matt Luker <luker.matt@gmail.com>
Co-authored-by: ScoGroMSFT <scogro@microsoft.com>
Co-authored-by: Liam Moat <contact@liammoat.com>
Co-authored-by: Liam Moat <liam.moat@microsoft.com>
Co-authored-by: Thomas Conté <tconte@microsoft.com>
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -351,7 +351,7 @@ Every array will produce the combinations of flat configurations when the method
351
351
"temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1."
352
352
},
353
353
"eval": {
354
-
"metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
354
+
"metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
This metric is backed by the [FuzzyWuzzy Python package](https://pypi.org/project/fuzzywuzzy/).
118
+
This metric is backed by the [RapidfFuzz Python package](https://github.com/rapidfuzz/RapidFuzz).
110
119
Calculates the fuzzy score between two documents using the levenshtein distance.
111
120
121
+
### Rouge retrieval metrics (Token based)
122
+
123
+
**Rouge** short for Recall-Oriented Understudy for Gisting Evaluation, is typically used in summarization evaluation tasks, comparing human produced references and system generated summaries. The core idea is to compare and validate sufficient overlap of common words or phrases in both reference and prediction. String metrics look at character level differences, whereas Rouge can help us compare token level matches. We use the [`rouge-score`](https://pypi.org/project/rouge-score/) to compute these measures. Here are some of the metrics we capture.
124
+
125
+
| Configuration Key | Calculation Base | Possible Values |
-**rouge1_precision**: The ROUGE-1 precision score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the ground_truth string.
131
+
-**rouge1_recall**: The ROUGE-1 recall score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
132
+
-**rouge1_fmeasure**: This is the harmonic mean of the ROUGE-1 precision and recall scores.
133
+
-**rouge2_precision**: The ROUGE-2 precision score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the ground_truth string.
134
+
-**rouge2_recall**: The ROUGE-2 recall score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the predicted string.
135
+
-**rouge2_fmeasure**: This is the harmonic mean of the ROUGE-2 precision and recall scores.
136
+
-**rougeL_precision**: The ROUGE-L precision score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
137
+
-**rougeL_recall**: The ROUGE-L recall score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the ground truth string.
138
+
-**rougeL_fmeasure**: This is the harmonic mean of the ROUGE-L precision and recall scores.
139
+
112
140
## BERT-based semantic similarity
113
141
114
142
The following set of metrics calculates semantic similarity between two strings as percentage of differences based on
Copy file name to clipboardExpand all lines: promptflow/rag-experiment-accelerator/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -117,7 +117,7 @@ az ml environment create --file ./environment.yaml -w $MLWorkSpaceName
117
117
"cross_encoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base",
118
118
"search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]",
119
119
"retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index",
120
-
"metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2 llm_context_precision, llm_answer_relevance. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens']",
120
+
"metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_context_precision, llm_answer_relevance. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens']",
121
121
"azure_oai_chat_deployment_name": "determines the Azure OpenAI chat deployment name",
122
122
"azure_oai_eval_deployment_name": "determines the Azure OpenAI evaluation deployment name",
0 commit comments