explodinggradients
diff --git a/‎docs/howtos/applications/benchmark_llm.md‎
Lines changed: 114 additions & 87 deletions b/‎docs/howtos/applications/benchmark_llm.md‎
Lines changed: 114 additions & 87 deletions
@@ -21,66 +21,71 @@ We'll use discount calculation as our test case: given a customer profile, calcu
 
 ## Set up your environment and API access
 
-First, ensure you have your API credentials configured:
+First, install the ragas-examples package which contains the benchmark LLM example code:
 
 ```bash
-export OPENAI_API_KEY=your_actual_api_key
+pip install ragas[examples]
 ```
 
-## Test your setup
-
-Verify your setup works:
+Next, ensure you have your API credentials configured:
 
 ```bash
-python -m ragas_examples.benchmark_llm.prompt
+export OPENAI_API_KEY=your_actual_api_key
 ```
 
-This will test a sample customer profile to ensure your setup works. 
-
-You should see structured JSON responses showing discount calculations and reasoning.
-
-??? example "Example output"
-    ```bash
-    $ python -m ragas_examples.benchmark_llm.prompt
-    ```
-    Output:
-    ```
-    === System Prompt ===
+## The LLM application
 
-    You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
+We've set up a simple LLM application for you in the examples package so you can focus on evaluation rather than building the application itself. The application calculates customer discounts based on business rules.
 
-    Discount rules:
-    - Age 65+ OR student status: 15% discount
-    - Annual income < $30,000: 20% discount  
-    - Premium member for 2+ years: 10% discount
-    - New customer (< 6 months): 5% discount
-
-    Rules can stack up to a maximum of 35% discount.
-
-    Respond in JSON format only:
-    {
-    "discount_percentage": number,
-    "reason": "clear explanation of which rules apply and calculations",
-    "applied_rules": ["list", "of", "applied", "rule", "names"]
-    }
+Here's the system prompt that defines the discount calculation logic:
 
+```python
+SYSTEM_PROMPT = """
+You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
+
+Discount rules:
+- Age 65+ OR student status: 15% discount
+- Annual income < $30,000: 20% discount  
+- Premium member for 2+ years: 10% discount
+- New customer (< 6 months): 5% discount
+
+Rules can stack up to a maximum of 35% discount.
+
+Respond in JSON format only:
+{
+  "discount_percentage": number,
+  "reason": "clear explanation of which rules apply and calculations",
+  "applied_rules": ["list", "of", "applied", "rule", "names"]
+}
+"""
+```
 
-    === Customer Profile ===
+You can test the application with a sample customer profile:
 
-        Customer Profile:
-        - Name: Sarah Johnson
-        - Age: 67
-        - Student: No
-        - Annual Income: $45,000
-        - Premium Member: Yes, for 3 years
-        - Account Age: 3 years
-        
+```python
+from ragas_examples.benchmark_llm.prompt import run_prompt
+
+# Test with a sample customer profile
+customer_profile = """
+Customer Profile:
+- Name: Sarah Johnson
+- Age: 67
+- Student: No
+- Annual Income: $45,000
+- Premium Member: Yes, for 3 years
+- Account Age: 3 years
+"""
+
+result = await run_prompt(customer_profile)
+print(result)
+```
 
-    === Running Prompt with default model gpt-4.1-nano-2025-04-14 ===
+??? "📋 Output"
+    ```json
     {
-    "discount_percentage": 25,
-    "reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
-    "applied_rules": ["Age 65+", "Premium member for 2+ years"]
+      "discount_percentage": 25,
+      "reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
+      "applied_rules": ["Age 65+", "Premium member for 2+ years"]
     }
     ```
 
@@ -117,20 +122,18 @@ It is better to sample real data from your application to create the dataset. If
 
 ```python
 def load_dataset():
-    """Load the dataset from CSV file."""
-    import os
-    # Get the directory where this file is located
+    """Load the dataset from CSV file. Downloads from GitHub if not found locally."""
+    import urllib.request
     current_dir = os.path.dirname(os.path.abspath(__file__))
-    
-    dataset = Dataset.load(
-        name="discount_benchmark",
-        backend="local/csv",
-        root_dir=current_dir
-    )
-    return dataset
+    dataset_path = os.path.join(current_dir, "datasets", "discount_benchmark.csv")
+    # Download dataset from GitHub if it doesn't exist locally
+    if not os.path.exists(dataset_path):
+        os.makedirs(os.path.dirname(dataset_path), exist_ok=True)
+        urllib.request.urlretrieve("https://raw.githubusercontent.com/explodinggradients/ragas/main/examples/ragas_examples/benchmark_llm/datasets/discount_benchmark.csv", dataset_path)
+    return Dataset.load(name="discount_benchmark", backend="local/csv", root_dir=current_dir)
 ```
 
-The dataset loader finds your CSV file in the `datasets/` directory and loads it for evaluation. 
+The dataset loader checks if the CSV file exists locally. If not found, it automatically downloads it from GitHub. 
 
 ### Metrics function
 
@@ -164,9 +167,9 @@ Each model evaluation follows this experiment pattern:
 
 ```python
 @experiment()
-async def benchmark_experiment(row):
-    # Get model response (run in thread to keep async runner responsive)
-    response = await asyncio.to_thread(run_prompt, row["customer_profile"], model=model_name)
+async def benchmark_experiment(row, model_name: str):
+    # Get model response
+    response = await run_prompt(row["customer_profile"], model=model_name)
 
     # Parse response (strict JSON mode expected)
     try:
@@ -184,34 +187,53 @@ async def benchmark_experiment(row):
     return {
         **row,
         "model": model_name,
-        "experiment_name": experiment_name,
         "response": response,
         "predicted_discount": predicted_discount,
         "score": score.value,
         "score_reason": score.reason
     }
-
-return benchmark_experiment
 ```
 
 ## Run experiments
 
-We've setup the main evals code such that you can run it with different models using CLI. We'll use these example models:
+Run evaluation experiments with both baseline and candidate models. We'll compare these example models:
 
 - Baseline: "gpt-4.1-nano-2025-04-14"
 - Candidate: "gpt-5-nano-2025-08-07"
 
-```bash
-# Baseline
-python -m ragas_examples.benchmark_llm.evals run --model "gpt-4.1-nano-2025-04-14"
-
-# Candidate
-python -m ragas_examples.benchmark_llm.evals run --model "gpt-5-nano-2025-08-07"
+```python
+from ragas_examples.benchmark_llm.evals import benchmark_experiment, load_dataset
+
+# Load dataset
+dataset = load_dataset()
+print(f"Dataset loaded with {len(dataset)} samples")
+
+# Run baseline experiment
+baseline_results = await benchmark_experiment.arun(
+    dataset,
+    name="gpt-4.1-nano-2025-04-14",
+    model_name="gpt-4.1-nano-2025-04-14"
+)
+
+# Calculate and display accuracy
+baseline_accuracy = sum(1 for r in baseline_results if r["score"] == "correct") / len(baseline_results)
+print(f"Baseline Accuracy: {baseline_accuracy:.2%}")
+
+# Run candidate experiment
+candidate_results = await benchmark_experiment.arun(
+    dataset,
+    name="gpt-5-nano-2025-08-07",
+    model_name="gpt-5-nano-2025-08-07"
+)
+
+# Calculate and display accuracy
+candidate_accuracy = sum(1 for r in candidate_results if r["score"] == "correct") / len(candidate_results)
+print(f"Candidate Accuracy: {candidate_accuracy:.2%}")
 ```
 
-Each command saves a CSV under `experiments/` with per-row results, including:
+Each experiment saves a CSV under `experiments/` with per-row results, including:
 
-- id, model, experiment_name, response, predicted_discount, score, score_reason
+- id, model, response, predicted_discount, score, score_reason
 
 ??? example "Sample experiment output (only showing few columns for readability)"
     | ID | Description | Expected | Predicted | Score | Score Reason |
@@ -229,34 +251,39 @@ Each command saves a CSV under `experiments/` with per-row results, including:
 
 ## Compare results
 
-After running experiments with different models, you'll want to see how they performed side-by-side. We've setup a simple compare command that takes your experiment results and puts them in one file so you can easily see which model did better on each test case. You can open this in your CSV viewer (Excel/Google Sheets/Numbers) to review.
+After running experiments with different models, compare their performance side-by-side:
 
-```bash
-python -m ragas_examples.benchmark_llm.evals compare \
-  --inputs 'experiments/exp1.csv' 'experiments/exp2.csv'
+```python
+from ragas_examples.benchmark_llm.evals import compare_inputs_to_output
+
+# Compare the two experiment results
+# Update these paths to match your actual experiment output files
+output_path = compare_inputs_to_output(
+    inputs=[
+        "experiments/gpt-4.1-nano-2025-04-14.csv",
+        "experiments/gpt-5-nano-2025-08-07.csv"
+    ]
+)
+
+print(f"Comparison saved to: {output_path}")
 ```
 
-This command will:
+This comparison:
 
-- Read both experiment files
-- Print the accuracy for each model
-- Create a new comparison file with results side-by-side
+- Reads both experiment files
+- Prints accuracy for each model
+- Creates a new CSV with results side-by-side
 
 The comparison file shows:
 
-- The test case details (customer profile, expected discount)
+- Test case details (customer profile, expected discount)
 - For each model: its response, whether it was correct, and why
 
-??? example "Example evaluation output"
-    ```bash
-    $ python -m ragas_examples.benchmark_llm.evals compare \
-      --inputs 'experiments/20250820-145315-gpt-4.1-nano-2025-04-14.csv' 'experiments/20250820-145327-gpt-5-nano-2025-08-07.csv'
-    ```
-    
+??? "📋 Output"
     ```
     gpt-4.1-nano-2025-04-14 Accuracy: 50.00%
     gpt-5-nano-2025-08-07 Accuracy: 90.00%
-    Combined comparison saved to: experiments/20250820-150548-comparison.csv
+    Comparison saved to: experiments/20250820-150548-comparison.csv
     ```
 
 ### Analyze results with the combined CSV