Skip to content

Commit 2a0c952

Browse files
nirgaclaude
andauthored
docs: add experiments section with programmatic execution guide (#101)
Co-authored-by: Claude <noreply@anthropic.com>
1 parent edd6ef7 commit 2a0c952

File tree

2 files changed

+239
-0
lines changed

2 files changed

+239
-0
lines changed

experiments/running-from-code.mdx

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
---
2+
title: "Running Experiments from Code"
3+
description: "Learn how to run experiments programmatically using the Traceloop SDK"
4+
---
5+
6+
You can run experiments programmatically using the Traceloop SDK. This allows you to systematically evaluate different AI model configurations, prompts, and approaches with your datasets.
7+
8+
## Setup
9+
10+
First, initialize the Traceloop client in your code:
11+
12+
```python
13+
from traceloop.sdk import Traceloop
14+
15+
# Initialize Traceloop
16+
Traceloop.init()
17+
client = Traceloop.client()
18+
```
19+
20+
## Basic Experiment Structure
21+
22+
An experiment consists of:
23+
- A **dataset** to test against
24+
- A **task function** that defines what your AI system should do
25+
- **evaluators** to measure performance
26+
- An **experiment slug** to identify the experiment
27+
28+
## Task Functions
29+
30+
Create task functions that define how your AI system processes each dataset item:
31+
32+
```python
33+
async def my_task_function(input_data):
34+
# Your AI processing logic here
35+
# This could involve calling OpenAI, Anthropic, etc.
36+
37+
response = await openai.ChatCompletion.acreate(
38+
model="gpt-4",
39+
messages=[
40+
{"role": "system", "content": "You are a helpful assistant."},
41+
{"role": "user", "content": input_data["question"]}
42+
]
43+
)
44+
45+
return {
46+
"response": response.choices[0].message.content,
47+
"model": "gpt-4"
48+
}
49+
```
50+
51+
## Running Experiments
52+
53+
Use the `experiment.run()` method to execute your experiment:
54+
55+
```python
56+
async def run_my_experiment():
57+
results, errors = await client.experiment.run(
58+
dataset_slug="my-dataset",
59+
dataset_version="v1",
60+
task=my_task_function,
61+
evaluators=["accuracy", "relevance"],
62+
experiment_slug="my-experiment-v1"
63+
)
64+
65+
print(f"Experiment completed with {len(results)} results and {len(errors)} errors")
66+
return results, errors
67+
```
68+
69+
## Comparing Different Approaches
70+
71+
You can run multiple experiments to compare different approaches:
72+
73+
```python
74+
# Task function with conservative prompting
75+
async def conservative_task(input_data):
76+
response = await openai.ChatCompletion.acreate(
77+
model="gpt-4",
78+
messages=[
79+
{"role": "system", "content": "Be very careful and conservative in your response."},
80+
{"role": "user", "content": input_data["question"]}
81+
]
82+
)
83+
return {"response": response.choices[0].message.content}
84+
85+
# Task function with creative prompting
86+
async def creative_task(input_data):
87+
response = await openai.ChatCompletion.acreate(
88+
model="gpt-4",
89+
messages=[
90+
{"role": "system", "content": "Be creative and think outside the box."},
91+
{"role": "user", "content": input_data["question"]}
92+
]
93+
)
94+
return {"response": response.choices[0].message.content}
95+
96+
# Run both experiments
97+
async def compare_approaches():
98+
# Conservative approach
99+
conservative_results, _ = await client.experiment.run(
100+
dataset_slug="my-dataset",
101+
dataset_version="v1",
102+
task=conservative_task,
103+
evaluators=["accuracy"],
104+
experiment_slug="conservative-approach"
105+
)
106+
107+
# Creative approach
108+
creative_results, _ = await client.experiment.run(
109+
dataset_slug="my-dataset",
110+
dataset_version="v1",
111+
task=creative_task,
112+
evaluators=["accuracy"],
113+
experiment_slug="creative-approach"
114+
)
115+
116+
return conservative_results, creative_results
117+
```
118+
119+
## Complete Example
120+
121+
Here's a full example that tests different email generation strategies for customer support:
122+
123+
```python
124+
import asyncio
125+
from traceloop.sdk import Traceloop
126+
import openai
127+
128+
# Initialize Traceloop
129+
Traceloop.init()
130+
client = Traceloop.client()
131+
132+
async def generate_support_email(customer_issue, tone="professional"):
133+
tone_prompts = {
134+
"professional": "You are a professional customer support agent. Write clear, formal responses that solve the customer's issue.",
135+
"friendly": "You are a friendly customer support agent. Write warm, conversational responses that make the customer feel valued.",
136+
"concise": "You are an efficient customer support agent. Write brief, direct responses that quickly address the customer's issue."
137+
}
138+
139+
response = await openai.ChatCompletion.acreate(
140+
model="gpt-4",
141+
messages=[
142+
{"role": "system", "content": tone_prompts[tone]},
143+
{"role": "user", "content": f"Customer issue: {customer_issue}"}
144+
]
145+
)
146+
147+
return response.choices[0].message.content
148+
149+
# Task function for professional tone
150+
async def professional_support_task(input_data):
151+
email = await generate_support_email(input_data["issue"], tone="professional")
152+
return {
153+
"email_response": email,
154+
"tone": "professional"
155+
}
156+
157+
# Task function for friendly tone
158+
async def friendly_support_task(input_data):
159+
email = await generate_support_email(input_data["issue"], tone="friendly")
160+
return {
161+
"email_response": email,
162+
"tone": "friendly"
163+
}
164+
165+
# Task function for concise tone
166+
async def concise_support_task(input_data):
167+
email = await generate_support_email(input_data["issue"], tone="concise")
168+
return {
169+
"email_response": email,
170+
"tone": "concise"
171+
}
172+
173+
async def run_support_experiment():
174+
dataset_config = {
175+
"dataset_slug": "customer-support-issues",
176+
"dataset_version": "v2",
177+
"evaluators": ["helpfulness", "clarity", "customer_satisfaction"]
178+
}
179+
180+
# Test professional tone
181+
professional_results, prof_errors = await client.experiment.run(
182+
**dataset_config,
183+
task=professional_support_task,
184+
experiment_slug="support-professional-tone"
185+
)
186+
187+
# Test friendly tone
188+
friendly_results, friendly_errors = await client.experiment.run(
189+
**dataset_config,
190+
task=friendly_support_task,
191+
experiment_slug="support-friendly-tone"
192+
)
193+
194+
# Test concise tone
195+
concise_results, concise_errors = await client.experiment.run(
196+
**dataset_config,
197+
task=concise_support_task,
198+
experiment_slug="support-concise-tone"
199+
)
200+
201+
print(f"Professional tone: {len(professional_results)} results, {len(prof_errors)} errors")
202+
print(f"Friendly tone: {len(friendly_results)} results, {len(friendly_errors)} errors")
203+
print(f"Concise tone: {len(concise_results)} results, {len(concise_errors)} errors")
204+
205+
return professional_results, friendly_results, concise_results
206+
207+
if __name__ == "__main__":
208+
asyncio.run(run_support_experiment())
209+
```
210+
211+
## Parameters
212+
213+
### `experiment.run()` Parameters
214+
215+
- `dataset_slug` (str): Identifier for your dataset
216+
- `dataset_version` (str): Version of the dataset to use
217+
- `task` (function): Async function that processes each dataset item
218+
- `evaluators` (list): List of evaluator names to measure performance
219+
- `experiment_slug` (str): Unique identifier for this experiment
220+
221+
### Task Function Requirements
222+
223+
Your task function should:
224+
- Be async (`async def`)
225+
- Accept one parameter (the input data from your dataset)
226+
- Return a dictionary with your results
227+
- Handle errors gracefully
228+
229+
## Best Practices
230+
231+
1. **Use descriptive experiment slugs** to easily identify different runs
232+
2. **Version your datasets** to ensure reproducible results
233+
3. **Handle errors** in your task functions to avoid experiment failures
234+
4. **Use appropriate evaluators** that match your use case
235+
5. **Compare multiple approaches** systematically to find the best solution

mint.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,10 @@
143143
"group": "Quick Start",
144144
"pages": ["hub/getting-started", "hub/configuration"]
145145
},
146+
{
147+
"group": "Experiments",
148+
"pages": ["experiments/running-from-code"]
149+
},
146150
{
147151
"group": "Monitoring",
148152
"pages": ["monitoring/introduction"]

0 commit comments

Comments
 (0)