You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #[issue_number]
- OR describe the issue: What problem does this solve? How can it be
replicated?
## Changes Made
<!-- Describe what you changed and why -->
-
-
-
## Testing
<!-- Describe how this should be tested -->
### How to Test
- [ ] Automated tests added/updated
- [ ] Manual testing steps:
1.
2.
3.
## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues:
- Documentation:
- External references:
## Screenshots/Examples (if applicable)
<!-- Add screenshots or code examples showing the change -->
---
<!--
Thank you for contributing to Ragas!
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->
Copy file name to clipboardExpand all lines: docs/howtos/applications/benchmark_llm.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,7 @@ Example dataset structure (add an `id` column for easy comparison):
106
106
| 2 | Arjun, aged 19, is a full-time computer-science undergraduate. His part-time job brings in about 45,000 dollars per year. He opened his account a year ago and has no premium membership. | 15 | Student only |
107
107
| 3 | Cynthia, a 40-year-old freelance artist, earns roughly 25,000 dollars a year. She is not studying anywhere, subscribed to our basic plan five years back and never upgraded to premium. | 20 | Low income only |
108
108
109
-
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
109
+
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
110
110
111
111
It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. Since our use case is slightly complex, we recommend using a model like gpt-5-high which can generate more accurate data. Always make sure to manually review and verify the data you use.
112
112
@@ -134,7 +134,7 @@ The dataset loader finds your CSV file in the `datasets/` directory and loads it
134
134
135
135
### Metrics function
136
136
137
-
It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). The evaluation uses this accuracy metric to score each response:
137
+
It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). The evaluation uses this accuracy metric to score each response:
We've setup the main evals code such that you can run it with different models using CLI. We’ll use these example models:
199
+
We've setup the main evals code such that you can run it with different models using CLI. We'll use these example models:
200
200
201
201
- Baseline: "gpt-4.1-nano-2025-04-14"
202
202
- Candidate: "gpt-5-nano-2025-08-07"
@@ -284,18 +284,18 @@ In this example run:
284
284
285
285
### How to read the rows
286
286
- Skim rows where the two models disagree.
287
-
- Use each row’s score_reason to see why it was marked correct/incorrect.
288
-
- Look for patterns (e.g., missed rule stacking, boundary cases like “almost 65”, exact income thresholds).
287
+
- Use each row's score_reason to see why it was marked correct/incorrect.
288
+
- Look for patterns (e.g., missed rule stacking, boundary cases like "almost 65", exact income thresholds).
289
289
290
290
### Beyond accuracy
291
-
- Check **cost** and **latency**. Higher accuracy may not be worth it if it’s too slow or too expensive for your use case.
291
+
- Check **cost** and **latency**. Higher accuracy may not be worth it if it's too slow or too expensive for your use case.
292
292
293
293
### Decide
294
294
- Switch if the new model is clearly more accurate on your important cases and fits your cost/latency needs.
295
295
- Stay if gains are small, failures hit critical cases, or cost/latency are not acceptable.
296
296
297
297
In this example:
298
-
- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, it’s the better default.
298
+
- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, it's the better default.
Copy file name to clipboardExpand all lines: docs/howtos/customizations/iterate_prompt.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ We've created a synthetic dataset for our use case. Each row has `id, text, labe
29
29
| 2 | SSO via Okta succeeds then bounces back to /login; colleagues can sign in; state mismatch; blocked from boards. | Account;ProductIssue | P0 |
30
30
| 3 | Need to export a board to PDF with comments and page numbers for audit; deadline next week. | HowTo | P2 |
31
31
32
-
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
32
+
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
33
33
34
34
It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. We recommend using a reasoning model like gpt-5 high-reasoning which can generate more accurate and complex data. Always make sure to manually review and verify the data you use.
35
35
@@ -83,16 +83,16 @@ This will run the prompt on sample case and print the results.
83
83
84
84
### Metrics for scoring
85
85
86
-
It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
86
+
It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
87
87
88
88
-`priority_accuracy`: Checks whether the predicted priority matches the expected priority; important for correct urgency triage.
89
89
-`labels_exact_match`: Checks whether the set of predicted labels exactly matches the expected labels; important to avoid over/under-tagging and helps us measure the accuracy of our system in labeling the cases.
90
90
91
91
```python
92
92
# examples/iterate_prompt/evals.py
93
93
import json
94
-
from ragas.experimental.metrics.discrete import discrete_metric
95
-
from ragas.experimental.metrics.result import MetricResult
94
+
from ragas.metrics.discrete import discrete_metric
The experiment function is used to run the prompt on a dataset. More information on Experiment can be found in [Experimentation - Core Concepts](../core_concepts/experimentation.md).
123
+
The experiment function is used to run the prompt on a dataset. More information on experimentation can be found in [Core Concepts - Experimentation](../../experimental/core_concepts/experimentation.md).
124
124
125
125
Notice that we are passing `prompt_file` as a parameter so that we can run experiments with different prompts. You can also pass other parameters to the experiment function like model, temperature, etc. and experiment with different configurations. It is recommended to change only 1 parameter at a time while doing experimentation.
The dataset loader is used to load the dataset into a Ragas dataset object. More information on Dataset can be found in [Datasets - Core Concepts](../core_concepts/datasets.md).
159
+
The dataset loader is used to load the dataset into a Ragas dataset object. More information on datasets can be found in [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md).
160
160
161
161
```python
162
162
# examples/iterate_prompt/evals.py
@@ -350,4 +350,4 @@ Stop iterating when improvements plateau or accuracy meets business requirements
350
350
351
351
Once you have your dataset and evaluation loop setup, you can expand this to testing more parameters like model, etc.
352
352
353
-
The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!
353
+
The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!
0 commit comments