|
| 1 | +--- |
| 2 | +title: "Running Experiments from Code" |
| 3 | +description: "Learn how to run experiments programmatically using the Traceloop SDK" |
| 4 | +--- |
| 5 | + |
| 6 | +You can run experiments programmatically using the Traceloop SDK. This allows you to systematically evaluate different AI model configurations, prompts, and approaches with your datasets. |
| 7 | + |
| 8 | +## Setup |
| 9 | + |
| 10 | +First, initialize the Traceloop client in your code: |
| 11 | + |
| 12 | +```python |
| 13 | +from traceloop.sdk import Traceloop |
| 14 | + |
| 15 | +# Initialize Traceloop |
| 16 | +Traceloop.init() |
| 17 | +client = Traceloop.client() |
| 18 | +``` |
| 19 | + |
| 20 | +## Basic Experiment Structure |
| 21 | + |
| 22 | +An experiment consists of: |
| 23 | +- A **dataset** to test against |
| 24 | +- A **task function** that defines what your AI system should do |
| 25 | +- **evaluators** to measure performance |
| 26 | +- An **experiment slug** to identify the experiment |
| 27 | + |
| 28 | +## Task Functions |
| 29 | + |
| 30 | +Create task functions that define how your AI system processes each dataset item: |
| 31 | + |
| 32 | +```python |
| 33 | +async def my_task_function(input_data): |
| 34 | + # Your AI processing logic here |
| 35 | + # This could involve calling OpenAI, Anthropic, etc. |
| 36 | + |
| 37 | + response = await openai.ChatCompletion.acreate( |
| 38 | + model="gpt-4", |
| 39 | + messages=[ |
| 40 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 41 | + {"role": "user", "content": input_data["question"]} |
| 42 | + ] |
| 43 | + ) |
| 44 | + |
| 45 | + return { |
| 46 | + "response": response.choices[0].message.content, |
| 47 | + "model": "gpt-4" |
| 48 | + } |
| 49 | +``` |
| 50 | + |
| 51 | +## Running Experiments |
| 52 | + |
| 53 | +Use the `experiment.run()` method to execute your experiment: |
| 54 | + |
| 55 | +```python |
| 56 | +async def run_my_experiment(): |
| 57 | + results, errors = await client.experiment.run( |
| 58 | + dataset_slug="my-dataset", |
| 59 | + dataset_version="v1", |
| 60 | + task=my_task_function, |
| 61 | + evaluators=["accuracy", "relevance"], |
| 62 | + experiment_slug="my-experiment-v1" |
| 63 | + ) |
| 64 | + |
| 65 | + print(f"Experiment completed with {len(results)} results and {len(errors)} errors") |
| 66 | + return results, errors |
| 67 | +``` |
| 68 | + |
| 69 | +## Comparing Different Approaches |
| 70 | + |
| 71 | +You can run multiple experiments to compare different approaches: |
| 72 | + |
| 73 | +```python |
| 74 | +# Task function with conservative prompting |
| 75 | +async def conservative_task(input_data): |
| 76 | + response = await openai.ChatCompletion.acreate( |
| 77 | + model="gpt-4", |
| 78 | + messages=[ |
| 79 | + {"role": "system", "content": "Be very careful and conservative in your response."}, |
| 80 | + {"role": "user", "content": input_data["question"]} |
| 81 | + ] |
| 82 | + ) |
| 83 | + return {"response": response.choices[0].message.content} |
| 84 | + |
| 85 | +# Task function with creative prompting |
| 86 | +async def creative_task(input_data): |
| 87 | + response = await openai.ChatCompletion.acreate( |
| 88 | + model="gpt-4", |
| 89 | + messages=[ |
| 90 | + {"role": "system", "content": "Be creative and think outside the box."}, |
| 91 | + {"role": "user", "content": input_data["question"]} |
| 92 | + ] |
| 93 | + ) |
| 94 | + return {"response": response.choices[0].message.content} |
| 95 | + |
| 96 | +# Run both experiments |
| 97 | +async def compare_approaches(): |
| 98 | + # Conservative approach |
| 99 | + conservative_results, _ = await client.experiment.run( |
| 100 | + dataset_slug="my-dataset", |
| 101 | + dataset_version="v1", |
| 102 | + task=conservative_task, |
| 103 | + evaluators=["accuracy"], |
| 104 | + experiment_slug="conservative-approach" |
| 105 | + ) |
| 106 | + |
| 107 | + # Creative approach |
| 108 | + creative_results, _ = await client.experiment.run( |
| 109 | + dataset_slug="my-dataset", |
| 110 | + dataset_version="v1", |
| 111 | + task=creative_task, |
| 112 | + evaluators=["accuracy"], |
| 113 | + experiment_slug="creative-approach" |
| 114 | + ) |
| 115 | + |
| 116 | + return conservative_results, creative_results |
| 117 | +``` |
| 118 | + |
| 119 | +## Complete Example |
| 120 | + |
| 121 | +Here's a full example that tests different email generation strategies for customer support: |
| 122 | + |
| 123 | +```python |
| 124 | +import asyncio |
| 125 | +from traceloop.sdk import Traceloop |
| 126 | +import openai |
| 127 | + |
| 128 | +# Initialize Traceloop |
| 129 | +Traceloop.init() |
| 130 | +client = Traceloop.client() |
| 131 | + |
| 132 | +async def generate_support_email(customer_issue, tone="professional"): |
| 133 | + tone_prompts = { |
| 134 | + "professional": "You are a professional customer support agent. Write clear, formal responses that solve the customer's issue.", |
| 135 | + "friendly": "You are a friendly customer support agent. Write warm, conversational responses that make the customer feel valued.", |
| 136 | + "concise": "You are an efficient customer support agent. Write brief, direct responses that quickly address the customer's issue." |
| 137 | + } |
| 138 | + |
| 139 | + response = await openai.ChatCompletion.acreate( |
| 140 | + model="gpt-4", |
| 141 | + messages=[ |
| 142 | + {"role": "system", "content": tone_prompts[tone]}, |
| 143 | + {"role": "user", "content": f"Customer issue: {customer_issue}"} |
| 144 | + ] |
| 145 | + ) |
| 146 | + |
| 147 | + return response.choices[0].message.content |
| 148 | + |
| 149 | +# Task function for professional tone |
| 150 | +async def professional_support_task(input_data): |
| 151 | + email = await generate_support_email(input_data["issue"], tone="professional") |
| 152 | + return { |
| 153 | + "email_response": email, |
| 154 | + "tone": "professional" |
| 155 | + } |
| 156 | + |
| 157 | +# Task function for friendly tone |
| 158 | +async def friendly_support_task(input_data): |
| 159 | + email = await generate_support_email(input_data["issue"], tone="friendly") |
| 160 | + return { |
| 161 | + "email_response": email, |
| 162 | + "tone": "friendly" |
| 163 | + } |
| 164 | + |
| 165 | +# Task function for concise tone |
| 166 | +async def concise_support_task(input_data): |
| 167 | + email = await generate_support_email(input_data["issue"], tone="concise") |
| 168 | + return { |
| 169 | + "email_response": email, |
| 170 | + "tone": "concise" |
| 171 | + } |
| 172 | + |
| 173 | +async def run_support_experiment(): |
| 174 | + dataset_config = { |
| 175 | + "dataset_slug": "customer-support-issues", |
| 176 | + "dataset_version": "v2", |
| 177 | + "evaluators": ["helpfulness", "clarity", "customer_satisfaction"] |
| 178 | + } |
| 179 | + |
| 180 | + # Test professional tone |
| 181 | + professional_results, prof_errors = await client.experiment.run( |
| 182 | + **dataset_config, |
| 183 | + task=professional_support_task, |
| 184 | + experiment_slug="support-professional-tone" |
| 185 | + ) |
| 186 | + |
| 187 | + # Test friendly tone |
| 188 | + friendly_results, friendly_errors = await client.experiment.run( |
| 189 | + **dataset_config, |
| 190 | + task=friendly_support_task, |
| 191 | + experiment_slug="support-friendly-tone" |
| 192 | + ) |
| 193 | + |
| 194 | + # Test concise tone |
| 195 | + concise_results, concise_errors = await client.experiment.run( |
| 196 | + **dataset_config, |
| 197 | + task=concise_support_task, |
| 198 | + experiment_slug="support-concise-tone" |
| 199 | + ) |
| 200 | + |
| 201 | + print(f"Professional tone: {len(professional_results)} results, {len(prof_errors)} errors") |
| 202 | + print(f"Friendly tone: {len(friendly_results)} results, {len(friendly_errors)} errors") |
| 203 | + print(f"Concise tone: {len(concise_results)} results, {len(concise_errors)} errors") |
| 204 | + |
| 205 | + return professional_results, friendly_results, concise_results |
| 206 | + |
| 207 | +if __name__ == "__main__": |
| 208 | + asyncio.run(run_support_experiment()) |
| 209 | +``` |
| 210 | + |
| 211 | +## Parameters |
| 212 | + |
| 213 | +### `experiment.run()` Parameters |
| 214 | + |
| 215 | +- `dataset_slug` (str): Identifier for your dataset |
| 216 | +- `dataset_version` (str): Version of the dataset to use |
| 217 | +- `task` (function): Async function that processes each dataset item |
| 218 | +- `evaluators` (list): List of evaluator names to measure performance |
| 219 | +- `experiment_slug` (str): Unique identifier for this experiment |
| 220 | + |
| 221 | +### Task Function Requirements |
| 222 | + |
| 223 | +Your task function should: |
| 224 | +- Be async (`async def`) |
| 225 | +- Accept one parameter (the input data from your dataset) |
| 226 | +- Return a dictionary with your results |
| 227 | +- Handle errors gracefully |
| 228 | + |
| 229 | +## Best Practices |
| 230 | + |
| 231 | +1. **Use descriptive experiment slugs** to easily identify different runs |
| 232 | +2. **Version your datasets** to ensure reproducible results |
| 233 | +3. **Handle errors** in your task functions to avoid experiment failures |
| 234 | +4. **Use appropriate evaluators** that match your use case |
| 235 | +5. **Compare multiple approaches** systematically to find the best solution |
0 commit comments