Skip to content

Commit 911a127

Browse files
authored
Simplify earlier how to guides in docs (#2319)
PR Description: ## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Addresses the complexity barrier in Ragas examples that was making it difficult for beginners to understand and use evaluation workflows - Removes overly complex async patterns, abstractions, and infrastructure code from examples that obscured the core evaluation concepts ## Changes Made <!-- Describe what you changed and why --> - **Text2SQL Examples Simplification**: Streamlined all text2sql evaluation components by removing unnecessary async patterns, timing infrastructure, and complex abstractions - **Database and Data Utils Cleanup**: Simplified `db_utils.py` and `data_utils.py` to focus on core functionality while removing batch processing and concurrency complexity - **Agent Evaluation Streamlining**: Simplified agent evaluation examples by removing indirection layers and factory patterns - **Benchmark LLM Simplification**: Converted async patterns to simpler synchronous approach and removed unnecessary abstractions - **Improve RAG Examples**: Streamlined evaluation code by removing indirection layers and complex patterns - **Documentation Updates**: Updated text2sql and benchmark_llm documentation to reflect simplified examples and remove obsolete parameters - **Core Library Improvements**: Minor fixes to validation, evaluation, and utility modules for better code quality ## Testing <!-- Describe how this should be tested --> ### How to Test - [ ] Automated tests added/updated - [ ] Manual testing steps: 1. Run the simplified text2sql evaluation examples to ensure functionality is preserved 2. Verify benchmark_llm examples work with simplified codebase 3. Test improve_rag examples to confirm streamlined evaluation flows 4. Check that documentation accurately reflects the simplified examples 5. Ensure core ragas evaluation functionality remains intact after utility changes - Significantly reduced line count: 2,105 deletions vs 818 additions
1 parent d4ff601 commit 911a127

File tree

17 files changed

+934
-2637
lines changed

17 files changed

+934
-2637
lines changed

docs/howtos/applications/benchmark_llm.md

Lines changed: 114 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -21,66 +21,71 @@ We'll use discount calculation as our test case: given a customer profile, calcu
2121
2222
## Set up your environment and API access
2323

24-
First, ensure you have your API credentials configured:
24+
First, install the ragas-examples package which contains the benchmark LLM example code:
2525

2626
```bash
27-
export OPENAI_API_KEY=your_actual_api_key
27+
pip install ragas[examples]
2828
```
2929

30-
## Test your setup
31-
32-
Verify your setup works:
30+
Next, ensure you have your API credentials configured:
3331

3432
```bash
35-
python -m ragas_examples.benchmark_llm.prompt
33+
export OPENAI_API_KEY=your_actual_api_key
3634
```
3735

38-
This will test a sample customer profile to ensure your setup works.
39-
40-
You should see structured JSON responses showing discount calculations and reasoning.
41-
42-
??? example "Example output"
43-
```bash
44-
$ python -m ragas_examples.benchmark_llm.prompt
45-
```
46-
Output:
47-
```
48-
=== System Prompt ===
36+
## The LLM application
4937

50-
You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
38+
We've set up a simple LLM application for you in the examples package so you can focus on evaluation rather than building the application itself. The application calculates customer discounts based on business rules.
5139

52-
Discount rules:
53-
- Age 65+ OR student status: 15% discount
54-
- Annual income < $30,000: 20% discount
55-
- Premium member for 2+ years: 10% discount
56-
- New customer (< 6 months): 5% discount
57-
58-
Rules can stack up to a maximum of 35% discount.
59-
60-
Respond in JSON format only:
61-
{
62-
"discount_percentage": number,
63-
"reason": "clear explanation of which rules apply and calculations",
64-
"applied_rules": ["list", "of", "applied", "rule", "names"]
65-
}
40+
Here's the system prompt that defines the discount calculation logic:
6641

42+
```python
43+
SYSTEM_PROMPT = """
44+
You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
45+
46+
Discount rules:
47+
- Age 65+ OR student status: 15% discount
48+
- Annual income < $30,000: 20% discount
49+
- Premium member for 2+ years: 10% discount
50+
- New customer (< 6 months): 5% discount
51+
52+
Rules can stack up to a maximum of 35% discount.
53+
54+
Respond in JSON format only:
55+
{
56+
"discount_percentage": number,
57+
"reason": "clear explanation of which rules apply and calculations",
58+
"applied_rules": ["list", "of", "applied", "rule", "names"]
59+
}
60+
"""
61+
```
6762

68-
=== Customer Profile ===
63+
You can test the application with a sample customer profile:
6964

70-
Customer Profile:
71-
- Name: Sarah Johnson
72-
- Age: 67
73-
- Student: No
74-
- Annual Income: $45,000
75-
- Premium Member: Yes, for 3 years
76-
- Account Age: 3 years
77-
65+
```python
66+
from ragas_examples.benchmark_llm.prompt import run_prompt
67+
68+
# Test with a sample customer profile
69+
customer_profile = """
70+
Customer Profile:
71+
- Name: Sarah Johnson
72+
- Age: 67
73+
- Student: No
74+
- Annual Income: $45,000
75+
- Premium Member: Yes, for 3 years
76+
- Account Age: 3 years
77+
"""
78+
79+
result = await run_prompt(customer_profile)
80+
print(result)
81+
```
7882

79-
=== Running Prompt with default model gpt-4.1-nano-2025-04-14 ===
83+
??? "📋 Output"
84+
```json
8085
{
81-
"discount_percentage": 25,
82-
"reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
83-
"applied_rules": ["Age 65+", "Premium member for 2+ years"]
86+
"discount_percentage": 25,
87+
"reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
88+
"applied_rules": ["Age 65+", "Premium member for 2+ years"]
8489
}
8590
```
8691

@@ -117,20 +122,18 @@ It is better to sample real data from your application to create the dataset. If
117122

118123
```python
119124
def load_dataset():
120-
"""Load the dataset from CSV file."""
121-
import os
122-
# Get the directory where this file is located
125+
"""Load the dataset from CSV file. Downloads from GitHub if not found locally."""
126+
import urllib.request
123127
current_dir = os.path.dirname(os.path.abspath(__file__))
124-
125-
dataset = Dataset.load(
126-
name="discount_benchmark",
127-
backend="local/csv",
128-
root_dir=current_dir
129-
)
130-
return dataset
128+
dataset_path = os.path.join(current_dir, "datasets", "discount_benchmark.csv")
129+
# Download dataset from GitHub if it doesn't exist locally
130+
if not os.path.exists(dataset_path):
131+
os.makedirs(os.path.dirname(dataset_path), exist_ok=True)
132+
urllib.request.urlretrieve("https://raw.githubusercontent.com/explodinggradients/ragas/main/examples/ragas_examples/benchmark_llm/datasets/discount_benchmark.csv", dataset_path)
133+
return Dataset.load(name="discount_benchmark", backend="local/csv", root_dir=current_dir)
131134
```
132135

133-
The dataset loader finds your CSV file in the `datasets/` directory and loads it for evaluation.
136+
The dataset loader checks if the CSV file exists locally. If not found, it automatically downloads it from GitHub.
134137

135138
### Metrics function
136139

@@ -164,9 +167,9 @@ Each model evaluation follows this experiment pattern:
164167

165168
```python
166169
@experiment()
167-
async def benchmark_experiment(row):
168-
# Get model response (run in thread to keep async runner responsive)
169-
response = await asyncio.to_thread(run_prompt, row["customer_profile"], model=model_name)
170+
async def benchmark_experiment(row, model_name: str):
171+
# Get model response
172+
response = await run_prompt(row["customer_profile"], model=model_name)
170173

171174
# Parse response (strict JSON mode expected)
172175
try:
@@ -184,34 +187,53 @@ async def benchmark_experiment(row):
184187
return {
185188
**row,
186189
"model": model_name,
187-
"experiment_name": experiment_name,
188190
"response": response,
189191
"predicted_discount": predicted_discount,
190192
"score": score.value,
191193
"score_reason": score.reason
192194
}
193-
194-
return benchmark_experiment
195195
```
196196

197197
## Run experiments
198198

199-
We've setup the main evals code such that you can run it with different models using CLI. We'll use these example models:
199+
Run evaluation experiments with both baseline and candidate models. We'll compare these example models:
200200

201201
- Baseline: "gpt-4.1-nano-2025-04-14"
202202
- Candidate: "gpt-5-nano-2025-08-07"
203203

204-
```bash
205-
# Baseline
206-
python -m ragas_examples.benchmark_llm.evals run --model "gpt-4.1-nano-2025-04-14"
207-
208-
# Candidate
209-
python -m ragas_examples.benchmark_llm.evals run --model "gpt-5-nano-2025-08-07"
204+
```python
205+
from ragas_examples.benchmark_llm.evals import benchmark_experiment, load_dataset
206+
207+
# Load dataset
208+
dataset = load_dataset()
209+
print(f"Dataset loaded with {len(dataset)} samples")
210+
211+
# Run baseline experiment
212+
baseline_results = await benchmark_experiment.arun(
213+
dataset,
214+
name="gpt-4.1-nano-2025-04-14",
215+
model_name="gpt-4.1-nano-2025-04-14"
216+
)
217+
218+
# Calculate and display accuracy
219+
baseline_accuracy = sum(1 for r in baseline_results if r["score"] == "correct") / len(baseline_results)
220+
print(f"Baseline Accuracy: {baseline_accuracy:.2%}")
221+
222+
# Run candidate experiment
223+
candidate_results = await benchmark_experiment.arun(
224+
dataset,
225+
name="gpt-5-nano-2025-08-07",
226+
model_name="gpt-5-nano-2025-08-07"
227+
)
228+
229+
# Calculate and display accuracy
230+
candidate_accuracy = sum(1 for r in candidate_results if r["score"] == "correct") / len(candidate_results)
231+
print(f"Candidate Accuracy: {candidate_accuracy:.2%}")
210232
```
211233

212-
Each command saves a CSV under `experiments/` with per-row results, including:
234+
Each experiment saves a CSV under `experiments/` with per-row results, including:
213235

214-
- id, model, experiment_name, response, predicted_discount, score, score_reason
236+
- id, model, response, predicted_discount, score, score_reason
215237

216238
??? example "Sample experiment output (only showing few columns for readability)"
217239
| ID | Description | Expected | Predicted | Score | Score Reason |
@@ -229,34 +251,39 @@ Each command saves a CSV under `experiments/` with per-row results, including:
229251

230252
## Compare results
231253

232-
After running experiments with different models, you'll want to see how they performed side-by-side. We've setup a simple compare command that takes your experiment results and puts them in one file so you can easily see which model did better on each test case. You can open this in your CSV viewer (Excel/Google Sheets/Numbers) to review.
254+
After running experiments with different models, compare their performance side-by-side:
233255

234-
```bash
235-
python -m ragas_examples.benchmark_llm.evals compare \
236-
--inputs 'experiments/exp1.csv' 'experiments/exp2.csv'
256+
```python
257+
from ragas_examples.benchmark_llm.evals import compare_inputs_to_output
258+
259+
# Compare the two experiment results
260+
# Update these paths to match your actual experiment output files
261+
output_path = compare_inputs_to_output(
262+
inputs=[
263+
"experiments/gpt-4.1-nano-2025-04-14.csv",
264+
"experiments/gpt-5-nano-2025-08-07.csv"
265+
]
266+
)
267+
268+
print(f"Comparison saved to: {output_path}")
237269
```
238270

239-
This command will:
271+
This comparison:
240272

241-
- Read both experiment files
242-
- Print the accuracy for each model
243-
- Create a new comparison file with results side-by-side
273+
- Reads both experiment files
274+
- Prints accuracy for each model
275+
- Creates a new CSV with results side-by-side
244276

245277
The comparison file shows:
246278

247-
- The test case details (customer profile, expected discount)
279+
- Test case details (customer profile, expected discount)
248280
- For each model: its response, whether it was correct, and why
249281

250-
??? example "Example evaluation output"
251-
```bash
252-
$ python -m ragas_examples.benchmark_llm.evals compare \
253-
--inputs 'experiments/20250820-145315-gpt-4.1-nano-2025-04-14.csv' 'experiments/20250820-145327-gpt-5-nano-2025-08-07.csv'
254-
```
255-
282+
??? "📋 Output"
256283
```
257284
gpt-4.1-nano-2025-04-14 Accuracy: 50.00%
258285
gpt-5-nano-2025-08-07 Accuracy: 90.00%
259-
Combined comparison saved to: experiments/20250820-150548-comparison.csv
286+
Comparison saved to: experiments/20250820-150548-comparison.csv
260287
```
261288

262289
### Analyze results with the combined CSV

0 commit comments

Comments
 (0)