You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR Description:
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Addresses the complexity barrier in Ragas examples that was making it
difficult for beginners to understand and use evaluation workflows
- Removes overly complex async patterns, abstractions, and
infrastructure code from examples that obscured the core evaluation
concepts
## Changes Made
<!-- Describe what you changed and why -->
- **Text2SQL Examples Simplification**: Streamlined all text2sql
evaluation components by removing unnecessary async patterns, timing
infrastructure, and complex abstractions
- **Database and Data Utils Cleanup**: Simplified `db_utils.py` and
`data_utils.py` to focus on core functionality while removing batch
processing and concurrency complexity
- **Agent Evaluation Streamlining**: Simplified agent evaluation
examples by removing indirection layers and factory patterns
- **Benchmark LLM Simplification**: Converted async patterns to simpler
synchronous approach and removed unnecessary abstractions
- **Improve RAG Examples**: Streamlined evaluation code by removing
indirection layers and complex patterns
- **Documentation Updates**: Updated text2sql and benchmark_llm
documentation to reflect simplified examples and remove obsolete
parameters
- **Core Library Improvements**: Minor fixes to validation, evaluation,
and utility modules for better code quality
## Testing
<!-- Describe how this should be tested -->
### How to Test
- [ ] Automated tests added/updated
- [ ] Manual testing steps:
1. Run the simplified text2sql evaluation examples to ensure
functionality is preserved
2. Verify benchmark_llm examples work with simplified codebase
3. Test improve_rag examples to confirm streamlined evaluation flows
4. Check that documentation accurately reflects the simplified examples
5. Ensure core ragas evaluation functionality remains intact after
utility changes
- Significantly reduced line count: 2,105 deletions vs 818 additions
@@ -21,66 +21,71 @@ We'll use discount calculation as our test case: given a customer profile, calcu
21
21
22
22
## Set up your environment and API access
23
23
24
-
First, ensure you have your API credentials configured:
24
+
First, install the ragas-examples package which contains the benchmark LLM example code:
25
25
26
26
```bash
27
-
export OPENAI_API_KEY=your_actual_api_key
27
+
pip install ragas[examples]
28
28
```
29
29
30
-
## Test your setup
31
-
32
-
Verify your setup works:
30
+
Next, ensure you have your API credentials configured:
33
31
34
32
```bash
35
-
python -m ragas_examples.benchmark_llm.prompt
33
+
export OPENAI_API_KEY=your_actual_api_key
36
34
```
37
35
38
-
This will test a sample customer profile to ensure your setup works.
39
-
40
-
You should see structured JSON responses showing discount calculations and reasoning.
41
-
42
-
??? example "Example output"
43
-
```bash
44
-
$ python -m ragas_examples.benchmark_llm.prompt
45
-
```
46
-
Output:
47
-
```
48
-
=== System Prompt ===
36
+
## The LLM application
49
37
50
-
You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
38
+
We've set up a simple LLM application for you in the examples package so you can focus on evaluation rather than building the application itself. The application calculates customer discounts based on business rules.
51
39
52
-
Discount rules:
53
-
- Age 65+ OR student status: 15% discount
54
-
- Annual income < $30,000: 20% discount
55
-
- Premium member for 2+ years: 10% discount
56
-
- New customer (< 6 months): 5% discount
57
-
58
-
Rules can stack up to a maximum of 35% discount.
59
-
60
-
Respond in JSON format only:
61
-
{
62
-
"discount_percentage": number,
63
-
"reason": "clear explanation of which rules apply and calculations",
Here's the system prompt that defines the discount calculation logic:
66
41
42
+
```python
43
+
SYSTEM_PROMPT="""
44
+
You are a discount calculation assistant. I will provide a customer profile and you must calculate their discount percentage and explain your reasoning.
45
+
46
+
Discount rules:
47
+
- Age 65+ OR student status: 15% discount
48
+
- Annual income < $30,000: 20% discount
49
+
- Premium member for 2+ years: 10% discount
50
+
- New customer (< 6 months): 5% discount
51
+
52
+
Rules can stack up to a maximum of 35% discount.
53
+
54
+
Respond in JSON format only:
55
+
{
56
+
"discount_percentage": number,
57
+
"reason": "clear explanation of which rules apply and calculations",
You can test the application with a sample customer profile:
69
64
70
-
Customer Profile:
71
-
- Name: Sarah Johnson
72
-
- Age: 67
73
-
- Student: No
74
-
- Annual Income: $45,000
75
-
- Premium Member: Yes, for 3 years
76
-
- Account Age: 3 years
77
-
65
+
```python
66
+
from ragas_examples.benchmark_llm.prompt import run_prompt
67
+
68
+
# Test with a sample customer profile
69
+
customer_profile ="""
70
+
Customer Profile:
71
+
- Name: Sarah Johnson
72
+
- Age: 67
73
+
- Student: No
74
+
- Annual Income: $45,000
75
+
- Premium Member: Yes, for 3 years
76
+
- Account Age: 3 years
77
+
"""
78
+
79
+
result =await run_prompt(customer_profile)
80
+
print(result)
81
+
```
78
82
79
-
=== Running Prompt with default model gpt-4.1-nano-2025-04-14 ===
83
+
??? "📋 Output"
84
+
```json
80
85
{
81
-
"discount_percentage": 25,
82
-
"reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
83
-
"applied_rules": ["Age 65+", "Premium member for 2+ years"]
86
+
"discount_percentage": 25,
87
+
"reason": "Sarah qualifies for a 15% discount due to age (67). She also gets a 10% discount for being a premium member for over 2 years. The total stacking of 15% and 10% discounts results in 25%. No other discounts apply based on income or account age.",
88
+
"applied_rules": ["Age 65+", "Premium member for 2+ years"]
84
89
}
85
90
```
86
91
@@ -117,20 +122,18 @@ It is better to sample real data from your application to create the dataset. If
117
122
118
123
```python
119
124
defload_dataset():
120
-
"""Load the dataset from CSV file."""
121
-
import os
122
-
# Get the directory where this file is located
125
+
"""Load the dataset from CSV file. Downloads from GitHub if not found locally."""
@@ -229,34 +251,39 @@ Each command saves a CSV under `experiments/` with per-row results, including:
229
251
230
252
## Compare results
231
253
232
-
After running experiments with different models, you'll want to see how they performed side-by-side. We've setup a simple compare command that takes your experiment results and puts them in one file so you can easily see which model did better on each test case. You can open this in your CSV viewer (Excel/Google Sheets/Numbers) to review.
254
+
After running experiments with different models, compare their performance side-by-side:
0 commit comments