Skip to content

Commit 67a39be

Browse files
authored
Move howtos (#2236)
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #[issue_number] - OR describe the issue: What problem does this solve? How can it be replicated? ## Changes Made <!-- Describe what you changed and why --> - - - ## Testing <!-- Describe how this should be tested --> ### How to Test - [ ] Automated tests added/updated - [ ] Manual testing steps: 1. 2. 3. ## References <!-- Link to related issues, discussions, forums, or external resources --> - Related issues: - Documentation: - External references: ## Screenshots/Examples (if applicable) <!-- Add screenshots or code examples showing the change --> --- <!-- Thank you for contributing to Ragas! Please fill out the sections above as completely as possible. The more information you provide, the faster your PR can be reviewed and merged. -->
1 parent 68eebea commit 67a39be

File tree

4 files changed

+17
-18
lines changed

4 files changed

+17
-18
lines changed

docs/experimental/howtos/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,4 @@
22

33
Step-by-step guides for specific tasks and goals using Ragas experimental features.
44

5-
- [How to Evaluate Your Prompt and Improve It](iterate_prompt.md)
6-
- [How to Evaluate a New LLM For Your Use Case](benchmark_llm.md)
5+
*Note: The how-to guides from this section have been moved to the main [How-to Guides](../../howtos/index.md) section.*

docs/experimental/howtos/benchmark_llm.md renamed to docs/howtos/applications/benchmark_llm.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ Example dataset structure (add an `id` column for easy comparison):
106106
| 2 | Arjun, aged 19, is a full-time computer-science undergraduate. His part-time job brings in about 45,000 dollars per year. He opened his account a year ago and has no premium membership. | 15 | Student only |
107107
| 3 | Cynthia, a 40-year-old freelance artist, earns roughly 25,000 dollars a year. She is not studying anywhere, subscribed to our basic plan five years back and never upgraded to premium. | 20 | Low income only |
108108

109-
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
109+
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
110110

111111
It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. Since our use case is slightly complex, we recommend using a model like gpt-5-high which can generate more accurate data. Always make sure to manually review and verify the data you use.
112112

@@ -134,7 +134,7 @@ The dataset loader finds your CSV file in the `datasets/` directory and loads it
134134

135135
### Metrics function
136136

137-
It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). The evaluation uses this accuracy metric to score each response:
137+
It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). The evaluation uses this accuracy metric to score each response:
138138

139139
```python
140140
@discrete_metric(name="discount_accuracy", allowed_values=["correct", "incorrect"])
@@ -196,7 +196,7 @@ return benchmark_experiment
196196

197197
## Run experiments
198198

199-
We've setup the main evals code such that you can run it with different models using CLI. Well use these example models:
199+
We've setup the main evals code such that you can run it with different models using CLI. We'll use these example models:
200200

201201
- Baseline: "gpt-4.1-nano-2025-04-14"
202202
- Candidate: "gpt-5-nano-2025-08-07"
@@ -284,18 +284,18 @@ In this example run:
284284

285285
### How to read the rows
286286
- Skim rows where the two models disagree.
287-
- Use each rows score_reason to see why it was marked correct/incorrect.
288-
- Look for patterns (e.g., missed rule stacking, boundary cases like almost 65, exact income thresholds).
287+
- Use each row's score_reason to see why it was marked correct/incorrect.
288+
- Look for patterns (e.g., missed rule stacking, boundary cases like "almost 65", exact income thresholds).
289289

290290
### Beyond accuracy
291-
- Check **cost** and **latency**. Higher accuracy may not be worth it if its too slow or too expensive for your use case.
291+
- Check **cost** and **latency**. Higher accuracy may not be worth it if it's too slow or too expensive for your use case.
292292

293293
### Decide
294294
- Switch if the new model is clearly more accurate on your important cases and fits your cost/latency needs.
295295
- Stay if gains are small, failures hit critical cases, or cost/latency are not acceptable.
296296

297297
In this example:
298-
- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, its the better default.
298+
- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, it's the better default.
299299

300300
## Adapting to your use case
301301

docs/experimental/howtos/iterate_prompt.md renamed to docs/howtos/customizations/iterate_prompt.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ We've created a synthetic dataset for our use case. Each row has `id, text, labe
2929
| 2 | SSO via Okta succeeds then bounces back to /login; colleagues can sign in; state mismatch; blocked from boards. | Account;ProductIssue | P0 |
3030
| 3 | Need to export a board to PDF with comments and page numbers for audit; deadline next week. | HowTo | P2 |
3131

32-
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
32+
To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
3333

3434
It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. We recommend using a reasoning model like gpt-5 high-reasoning which can generate more accurate and complex data. Always make sure to manually review and verify the data you use.
3535

@@ -83,16 +83,16 @@ This will run the prompt on sample case and print the results.
8383

8484
### Metrics for scoring
8585

86-
It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
86+
It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
8787

8888
- `priority_accuracy`: Checks whether the predicted priority matches the expected priority; important for correct urgency triage.
8989
- `labels_exact_match`: Checks whether the set of predicted labels exactly matches the expected labels; important to avoid over/under-tagging and helps us measure the accuracy of our system in labeling the cases.
9090

9191
```python
9292
# examples/iterate_prompt/evals.py
9393
import json
94-
from ragas.experimental.metrics.discrete import discrete_metric
95-
from ragas.experimental.metrics.result import MetricResult
94+
from ragas.metrics.discrete import discrete_metric
95+
from ragas.metrics.result import MetricResult
9696

9797
@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
9898
def labels_exact_match(prediction: str, expected_labels: str):
@@ -120,7 +120,7 @@ def priority_accuracy(prediction: str, expected_priority: str):
120120

121121
### The experiment function
122122

123-
The experiment function is used to run the prompt on a dataset. More information on Experiment can be found in [Experimentation - Core Concepts](../core_concepts/experimentation.md).
123+
The experiment function is used to run the prompt on a dataset. More information on experimentation can be found in [Core Concepts - Experimentation](../../experimental/core_concepts/experimentation.md).
124124

125125
Notice that we are passing `prompt_file` as a parameter so that we can run experiments with different prompts. You can also pass other parameters to the experiment function like model, temperature, etc. and experiment with different configurations. It is recommended to change only 1 parameter at a time while doing experimentation.
126126

@@ -156,7 +156,7 @@ async def support_triage_experiment(row, prompt_file: str, experiment_name: str)
156156

157157
### Dataset loader (CSV)
158158

159-
The dataset loader is used to load the dataset into a Ragas dataset object. More information on Dataset can be found in [Datasets - Core Concepts](../core_concepts/datasets.md).
159+
The dataset loader is used to load the dataset into a Ragas dataset object. More information on datasets can be found in [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md).
160160

161161
```python
162162
# examples/iterate_prompt/evals.py
@@ -350,4 +350,4 @@ Stop iterating when improvements plateau or accuracy meets business requirements
350350

351351
Once you have your dataset and evaluation loop setup, you can expand this to testing more parameters like model, etc.
352352

353-
The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!
353+
The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!

mkdocs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,6 @@ nav:
9191
- Experimentation: experimental/core_concepts/experimentation.md
9292
- How-to Guides:
9393
- experimental/howtos/index.md
94-
- Improve a Prompt: experimental/howtos/iterate_prompt.md
95-
- Evaluate a New LLM: experimental/howtos/benchmark_llm.md
9694
- 🛠️ How-to Guides:
9795
- howtos/index.md
9896
- Customizations:
@@ -101,6 +99,7 @@ nav:
10199
- Customise models: howtos/customizations/customize_models.md
102100
- Run Config: howtos/customizations/_run_config.md
103101
- Caching: howtos/customizations/_caching.md
102+
- Iterate and Improve Prompts: howtos/customizations/iterate_prompt.md
104103
- Metrics:
105104
- Modify Prompts: howtos/customizations/metrics/_modifying-prompts-metrics.md
106105
- Adapt Metrics to Languages: howtos/customizations/metrics/_metrics_language_adaptation.md
@@ -122,6 +121,7 @@ nav:
122121
- Single-hop Query Testset: howtos/applications/singlehop_testset_gen.md
123122
- Benchmarking:
124123
- Benchmarking Gemini models: howtos/applications/gemini_benchmarking.md
124+
- Evaluate a New LLM: howtos/applications/benchmark_llm.md
125125
- Integrations:
126126
- howtos/integrations/index.md
127127
- Arize: howtos/integrations/_arize.md

0 commit comments

Comments
 (0)