Move howtos (#2236)

rhlbhatnagar · web-flow · commit 67a39bee98b4 · 2025-08-31T07:57:58.000+05:30
## Issue Link / Problem Description
&lt;!-- Link to related issue or describe the problem this PR solves --&gt;
- Fixes #[issue_number]
- OR describe the issue: What problem does this solve? How can it be
replicated?

## Changes Made
&lt;!-- Describe what you changed and why --&gt;
- 
- 
- 

## Testing
&lt;!-- Describe how this should be tested --&gt;
### How to Test
- [ ] Automated tests added/updated
- [ ] Manual testing steps:
  1. 
  2. 
  3. 

## References
&lt;!-- Link to related issues, discussions, forums, or external resources
--&gt;
- Related issues: 
- Documentation: 
- External references: 

## Screenshots/Examples (if applicable)
&lt;!-- Add screenshots or code examples showing the change --&gt;

---
&lt;!-- 
Thank you for contributing to Ragas! 
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
--&gt;
diff --git a/docs/experimental/howtos/index.md b/docs/experimental/howtos/index.md
@@ -2,5 +2,4 @@
 
 Step-by-step guides for specific tasks and goals using Ragas experimental features.
 
-- [How to Evaluate Your Prompt and Improve It](iterate_prompt.md)
-- [How to Evaluate a New LLM For Your Use Case](benchmark_llm.md)
+*Note: The how-to guides from this section have been moved to the main [How-to Guides](../../howtos/index.md) section.*
diff --git a/docs/howtos/applications/benchmark_llm.md b/docs/howtos/applications/benchmark_llm.md
@@ -106,7 +106,7 @@ Example dataset structure (add an `id` column for easy comparison):
 | 2 | Arjun, aged 19, is a full-time computer-science undergraduate. His part-time job brings in about 45,000 dollars per year. He opened his account a year ago and has no premium membership. | 15 | Student only |
 | 3 | Cynthia, a 40-year-old freelance artist, earns roughly 25,000 dollars a year. She is not studying anywhere, subscribed to our basic plan five years back and never upgraded to premium. | 20 | Low income only |
 
-To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
+To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
 
 It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. Since our use case is slightly complex, we recommend using a model like gpt-5-high which can generate more accurate data. Always make sure to manually review and verify the data you use. 
 
@@ -134,7 +134,7 @@ The dataset loader finds your CSV file in the `datasets/` directory and loads it
 
 ### Metrics function
 
-It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). The evaluation uses this accuracy metric to score each response:
+It is generally better to use a simple metric. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). The evaluation uses this accuracy metric to score each response:
 
 ```python
 @discrete_metric(name="discount_accuracy", allowed_values=["correct", "incorrect"])
@@ -196,7 +196,7 @@ return benchmark_experiment
 
 ## Run experiments
 
-We've setup the main evals code such that you can run it with different models using CLI. We’ll use these example models:
+We've setup the main evals code such that you can run it with different models using CLI. We'll use these example models:
 
 - Baseline: "gpt-4.1-nano-2025-04-14"
 - Candidate: "gpt-5-nano-2025-08-07"
@@ -284,18 +284,18 @@ In this example run:
 
 ### How to read the rows
 - Skim rows where the two models disagree.
-- Use each row’s score_reason to see why it was marked correct/incorrect.
-- Look for patterns (e.g., missed rule stacking, boundary cases like “almost 65”, exact income thresholds).
+- Use each row's score_reason to see why it was marked correct/incorrect.
+- Look for patterns (e.g., missed rule stacking, boundary cases like "almost 65", exact income thresholds).
 
 ### Beyond accuracy
-- Check **cost** and **latency**. Higher accuracy may not be worth it if it’s too slow or too expensive for your use case.
+- Check **cost** and **latency**. Higher accuracy may not be worth it if it's too slow or too expensive for your use case.
 
 ### Decide
 - Switch if the new model is clearly more accurate on your important cases and fits your cost/latency needs.
 - Stay if gains are small, failures hit critical cases, or cost/latency are not acceptable.
 
 In this example:
-- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, it’s the better default.
+- We would switch to "gpt-5-nano-2025-08-07". It improves accuracy from 50% to 90% (+40%) and fixes the key failure modes (missed rule stacking, boundary conditions). If its latency/cost fits your constraints, it's the better default.
 
 ## Adapting to your use case
 
diff --git a/docs/howtos/customizations/iterate_prompt.md b/docs/howtos/customizations/iterate_prompt.md
@@ -29,7 +29,7 @@ We've created a synthetic dataset for our use case. Each row has `id, text, labe
 | 2  | SSO via Okta succeeds then bounces back to /login; colleagues can sign in; state mismatch; blocked from boards.    | Account;ProductIssue   | P0       |
 | 3  | Need to export a board to PDF with comments and page numbers for audit; deadline next week.                         | HowTo                  | P2       |
 
-To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Datasets - Core Concepts](../core_concepts/datasets.md) for more information.
+To customize the dataset for your use case, create a `datasets/` directory and add your own CSV file. You can also connect to different backends. Refer to [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md) for more information.
 
 It is better to sample real data from your application to create the dataset. If that is not available, you can generate synthetic data using an LLM. We recommend using a reasoning model like gpt-5 high-reasoning which can generate more accurate and complex data. Always make sure to manually review and verify the data you use. 
 
@@ -83,16 +83,16 @@ This will run the prompt on sample case and print the results.
 
 ### Metrics for scoring
 
-It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Metrics - Core Concepts](../core_concepts/metrics.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
+It is generally better to use a simpler metric instead of a complex one. You should use a metric relevant to your use case. More information on metrics can be found in [Core Concepts - Metrics](../../concepts/metrics/index.md). Here we use two discrete metrics: `labels_exact_match` and `priority_accuracy`. Keeping them separate helps analyze and fix different failure modes.
 
 - `priority_accuracy`: Checks whether the predicted priority matches the expected priority; important for correct urgency triage.
 - `labels_exact_match`: Checks whether the set of predicted labels exactly matches the expected labels; important to avoid over/under-tagging and helps us measure the accuracy of our system in labeling the cases.
 
 ```python
 # examples/iterate_prompt/evals.py
 import json
-from ragas.experimental.metrics.discrete import discrete_metric
-from ragas.experimental.metrics.result import MetricResult
+from ragas.metrics.discrete import discrete_metric
+from ragas.metrics.result import MetricResult
 
 @discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
 def labels_exact_match(prediction: str, expected_labels: str):
@@ -120,7 +120,7 @@ def priority_accuracy(prediction: str, expected_priority: str):
 
 ### The experiment function
 
-The experiment function is used to run the prompt on a dataset. More information on  Experiment can be found in [Experimentation - Core Concepts](../core_concepts/experimentation.md).
+The experiment function is used to run the prompt on a dataset. More information on experimentation can be found in [Core Concepts - Experimentation](../../experimental/core_concepts/experimentation.md).
 
 Notice that we are passing `prompt_file` as a parameter so that we can run experiments with different prompts. You can also pass other parameters to the experiment function like model, temperature, etc. and experiment with different configurations. It is recommended to change only 1 parameter at a time while doing experimentation.
 
@@ -156,7 +156,7 @@ async def support_triage_experiment(row, prompt_file: str, experiment_name: str)
 
 ### Dataset loader (CSV)
 
-The dataset loader is used to load the dataset into a Ragas dataset object. More information on Dataset can be found in [Datasets - Core Concepts](../core_concepts/datasets.md).
+The dataset loader is used to load the dataset into a Ragas dataset object. More information on datasets can be found in [Core Concepts - Evaluation Dataset](../../concepts/components/eval_dataset.md).
 
 ```python
 # examples/iterate_prompt/evals.py
@@ -350,4 +350,4 @@ Stop iterating when improvements plateau or accuracy meets business requirements
 
 Once you have your dataset and evaluation loop setup, you can expand this to testing more parameters like model, etc. 
 
-The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!
+The Ragas framework handles the orchestration, parallel execution, and result aggregation automatically for you, helping you evaluate and focus on your use case!
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -91,8 +91,6 @@ nav:
         - Experimentation: experimental/core_concepts/experimentation.md
     - How-to Guides:
         - experimental/howtos/index.md
-        - Improve a Prompt: experimental/howtos/iterate_prompt.md
-        - Evaluate a New LLM: experimental/howtos/benchmark_llm.md
   - 🛠️ How-to Guides:
       - howtos/index.md
       - Customizations:
@@ -101,6 +99,7 @@ nav:
               - Customise models: howtos/customizations/customize_models.md
               - Run Config: howtos/customizations/_run_config.md
               - Caching: howtos/customizations/_caching.md
+              - Iterate and Improve Prompts: howtos/customizations/iterate_prompt.md
           - Metrics:
               - Modify Prompts: howtos/customizations/metrics/_modifying-prompts-metrics.md
               - Adapt Metrics to Languages: howtos/customizations/metrics/_metrics_language_adaptation.md
@@ -122,6 +121,7 @@ nav:
             - Single-hop Query Testset: howtos/applications/singlehop_testset_gen.md
           - Benchmarking:
             - Benchmarking Gemini models: howtos/applications/gemini_benchmarking.md
+            - Evaluate a New LLM: howtos/applications/benchmark_llm.md
       - Integrations:
           - howtos/integrations/index.md
           - Arize: howtos/integrations/_arize.md