Skip to content

Commit e190add

Browse files
Refactor datadog debugging workflow to follow basic_action pattern (#272)
Co-authored-by: openhands <openhands@all-hands.dev>
1 parent be9725b commit e190add

File tree

4 files changed

+1293
-0
lines changed

4 files changed

+1293
-0
lines changed
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# Datadog Error Debugging Workflow
2+
3+
This example demonstrates how to use OpenHands agents to automatically debug errors from Datadog in a GitHub Actions workflow.
4+
5+
## Overview
6+
7+
The workflow:
8+
1. Fetches errors from Datadog based on configurable queries
9+
2. Searches for or creates GitHub issues to track errors
10+
3. Clones relevant repositories for comprehensive analysis
11+
4. Uses OpenHands AI agents to analyze code and identify root causes
12+
5. Posts debugging insights as comments on GitHub issues
13+
14+
## Files
15+
16+
- `workflow.yml` - GitHub Actions workflow with manual trigger
17+
- `datadog_debugging.py` - Main debugging script
18+
- `debug_prompt.jinja` - Template for AI debugging prompts
19+
20+
## Features
21+
22+
### Manual Trigger
23+
Run on-demand via GitHub Actions UI with configurable inputs:
24+
- **Query Type**: Choose between `log-query` (search) or `log-error-id` (specific error ID)
25+
- **Datadog Query**:
26+
- For `log-query`: Search query like `service:deploy ClientDisconnect`
27+
- For `log-error-id`: Specific error tracking ID like `2adba034-ab5a-11f0-b04e-da7ad0900000`
28+
- Repository list to analyze
29+
- Issue repository for tracking
30+
- Parent issue for organization
31+
- LLM model selection
32+
33+
### Smart Issue Management
34+
- Searches for existing issues before creating duplicates
35+
- Uses URL encoding for proper GitHub API queries
36+
- Selects oldest matching issue when duplicates exist
37+
- Links to parent tracking issue
38+
39+
### Multi-Repository Analysis
40+
- Clone multiple repositories for comprehensive context
41+
- Agent has full view of all relevant codebases
42+
- Identifies root causes across repository boundaries
43+
44+
### AI-Powered Debugging
45+
- Automatic code analysis using OpenHands agents
46+
- Identifies error locations and root causes
47+
- Provides actionable fix recommendations
48+
- Posts detailed findings as GitHub comments
49+
50+
## Setup
51+
52+
### Required Secrets
53+
54+
Configure these in your repository Settings → Secrets and variables → Actions:
55+
56+
```yaml
57+
DD_API_KEY: Your Datadog API key
58+
DD_APP_KEY: Your Datadog Application key
59+
DD_SITE: Your Datadog site (e.g., us5.datadoghq.com)
60+
LLM_API_KEY: API key for LLM service
61+
LLM_BASE_URL: Base URL for LLM service (optional)
62+
```
63+
64+
**Note**: `GITHUB_TOKEN` is automatically provided by GitHub Actions.
65+
66+
### Installation
67+
68+
1. Copy `workflow.yml` to your repository's `.github/workflows/` directory (e.g., `.github/workflows/datadog-debugging.yml`)
69+
2. Configure the required secrets in repository Settings → Secrets and variables → Actions
70+
3. Optionally, customize the workflow inputs and defaults in the YAML file
71+
72+
**Note**: The workflow automatically downloads the latest version of `datadog_debugging.py` and `debug_prompt.jinja` from the SDK repository at runtime. No need to copy these files to your repository unless you want to customize them.
73+
74+
## Usage
75+
76+
### Via GitHub Actions UI
77+
78+
1. Go to the **Actions** tab in your repository
79+
2. Select **Datadog Error Debugging** workflow
80+
3. Click **Run workflow**
81+
4. Configure inputs:
82+
- **Query Type**: Choose `log-query` or `log-error-id` (default: `log-query`)
83+
- **Datadog Query**:
84+
- For `log-query`: Search query (default: `service:deploy ClientDisconnect`)
85+
- For `log-error-id`: Error tracking ID (e.g., `2adba034-ab5a-11f0-b04e-da7ad0900000`)
86+
- **Repository List**: Comma-separated repos to analyze (default: `OpenHands/OpenHands,All-Hands-AI/infra`)
87+
- **Issue Repository**: Where to create issues (default: `All-Hands-AI/infra`)
88+
- **Parent Issue**: Optional parent issue URL for tracking
89+
- **Issue Prefix**: Prefix for issue titles (default: `DataDog Error: `)
90+
- **LLM Model**: Model to use (default: `openhands/claude-sonnet-4-5-20250929`)
91+
5. Click **Run workflow**
92+
93+
### Via GitHub CLI
94+
95+
**Search for errors matching a query:**
96+
```bash
97+
gh workflow run datadog-debugging.yml \
98+
-f query_type="log-query" \
99+
-f datadog_query="service:deploy ClientDisconnect" \
100+
-f repo_list="OpenHands/OpenHands,All-Hands-AI/infra" \
101+
-f issue_repo="All-Hands-AI/infra"
102+
```
103+
104+
**Debug a specific error by ID:**
105+
```bash
106+
gh workflow run datadog-debugging.yml \
107+
-f query_type="log-error-id" \
108+
-f datadog_query="2adba034-ab5a-11f0-b04e-da7ad0900000" \
109+
-f repo_list="OpenHands/OpenHands,All-Hands-AI/infra,All-Hands-AI/deploy" \
110+
-f issue_repo="All-Hands-AI/infra"
111+
```
112+
113+
## Example
114+
115+
### Input (Search Query)
116+
```yaml
117+
query_type: "log-query"
118+
datadog_query: "service:deploy ClientDisconnect"
119+
repo_list: "OpenHands/OpenHands,All-Hands-AI/infra,All-Hands-AI/deploy"
120+
issue_repo: "All-Hands-AI/infra"
121+
issue_parent: "https://github.com/All-Hands-AI/infra/issues/672"
122+
```
123+
124+
### Input (Specific Error ID)
125+
```yaml
126+
query_type: "log-error-id"
127+
datadog_query: "2adba034-ab5a-11f0-b04e-da7ad0900000"
128+
repo_list: "OpenHands/OpenHands,All-Hands-AI/infra,All-Hands-AI/deploy"
129+
issue_repo: "All-Hands-AI/infra"
130+
issue_parent: "https://github.com/All-Hands-AI/infra/issues/672"
131+
```
132+
133+
### Output
134+
- **Console**: Progress logs showing error fetching, repository cloning, and agent analysis
135+
- **GitHub Issue**: Created or updated with error details
136+
- **GitHub Comment**: AI-generated analysis with root cause and recommendations
137+
- **Artifacts**: Debugging data and logs saved for 7 days
138+
139+
### Real Example
140+
141+
See a real run with production data:
142+
- Error: `starlette.requests.ClientDisconnect` (1,526 occurrences)
143+
- Issue: https://github.com/All-Hands-AI/infra/issues/703
144+
- AI Analysis: https://github.com/All-Hands-AI/infra/issues/703#issuecomment-3480707049
145+
146+
The agent identified:
147+
- Error locations in `github.py` and `gitlab.py`
148+
- Root cause: Unhandled `ClientDisconnect` exceptions
149+
- Recommendations: Add proper error handling for client disconnections
150+
151+
## Configuration
152+
153+
### Datadog Query Examples
154+
155+
```yaml
156+
# ClientDisconnect errors
157+
service:deploy ClientDisconnect
158+
159+
# Server errors (5xx)
160+
service:deploy http.status_code:5*
161+
162+
# Database errors
163+
service:deploy (database OR postgresql) status:error
164+
165+
# Authentication errors
166+
service:deploy (authentication OR authorization) status:error
167+
168+
# Rate limit errors
169+
service:deploy rate_limit status:error
170+
```
171+
172+
### Repository List Format
173+
174+
Comma-separated list of `owner/repo`:
175+
```
176+
OpenHands/OpenHands,All-Hands-AI/infra,All-Hands-AI/deploy
177+
```
178+
179+
### LLM Model Options
180+
181+
- `openhands/claude-sonnet-4-5-20250929` - Best quality (default)
182+
- `openhands/claude-haiku-4-5-20251001` - Faster, cheaper
183+
- `anthropic/claude-3-5-sonnet-20241022` - Alternative
184+
185+
## Workflow Details
186+
187+
### Inputs
188+
189+
| Input | Type | Required | Default | Description |
190+
|-------|------|----------|---------|-------------|
191+
| `datadog_query` | string | Yes | `service:deploy ClientDisconnect` | Datadog query to search for errors |
192+
| `repo_list` | string | Yes | `OpenHands/OpenHands,All-Hands-AI/infra` | Comma-separated list of repositories |
193+
| `issue_repo` | string | Yes | `All-Hands-AI/infra` | Repository to create/update issues in |
194+
| `issue_parent` | string | No | - | Parent GitHub issue URL for tracking |
195+
| `issue_prefix` | string | No | `DataDog Error: ` | Prefix for issue titles |
196+
| `max_errors` | string | No | `5` | Maximum number of errors to fetch |
197+
| `llm_model` | string | No | `openhands/claude-sonnet-4-5-20250929` | LLM model to use |
198+
199+
### Outputs
200+
201+
- **GitHub Issues**: Created or updated with error details
202+
- **GitHub Comments**: AI analysis posted to issues
203+
- **Artifacts**: Debugging data and logs (retained for 7 days)
204+
205+
### Permissions
206+
207+
```yaml
208+
permissions:
209+
contents: read # Clone repositories
210+
issues: write # Create/update issues and comments
211+
```
212+
213+
## Customization
214+
215+
### For Production Use
216+
217+
Consider creating a separate configuration repository with:
218+
- Scheduled runs (daily for critical, weekly for comprehensive)
219+
- Predefined error query categories
220+
- Repository group definitions
221+
- Environment-specific settings
222+
223+
See the All-Hands-AI/infra example for a production-ready implementation.
224+
225+
### Adding Scheduled Runs
226+
227+
Add to the workflow's `on:` section:
228+
229+
```yaml
230+
on:
231+
workflow_dispatch:
232+
# ... existing inputs ...
233+
234+
schedule:
235+
# Daily at 09:00 UTC for critical errors
236+
- cron: '0 9 * * *'
237+
# Weekly on Monday at 09:00 UTC for full scan
238+
- cron: '0 9 * * 1'
239+
```
240+
241+
### Matrix Strategy
242+
243+
Run multiple queries in parallel:
244+
245+
```yaml
246+
jobs:
247+
debug-errors:
248+
strategy:
249+
matrix:
250+
query:
251+
- "service:deploy ClientDisconnect"
252+
- "service:deploy http.status_code:5*"
253+
- "service:deploy database status:error"
254+
fail-fast: false
255+
```
256+
257+
## Troubleshooting
258+
259+
### Workflow Fails to Start
260+
- Verify all required secrets are configured
261+
- Check `GITHUB_TOKEN` has necessary permissions
262+
- Review workflow syntax with `yamllint`
263+
264+
### No Issues Created
265+
- Verify issue repository exists and is accessible
266+
- Check `GITHUB_TOKEN` has `issues: write` permission
267+
- Review workflow logs for API errors
268+
269+
### Agent Analysis Incomplete
270+
- Increase workflow timeout if needed
271+
- Check `LLM_API_KEY` is valid and has quota
272+
- Try a different LLM model
273+
- Reduce number of repositories to analyze
274+
275+
### Repository Clone Failures
276+
- Verify repository names use `owner/repo` format
277+
- Check `GITHUB_TOKEN` has access to private repos
278+
- Ensure repositories exist and are accessible
279+
280+
## Related Examples
281+
282+
- **Basic Action**: `examples/03_github_workflows/01_basic_action/` - Simple workflow example
283+
- **PR Review**: `examples/03_github_workflows/02_pr_review/` - PR automation example
284+
- **TODO Management**: `examples/03_github_workflows/03_todo_management/` - Automated TODO tracking
285+
286+
## Benefits
287+
288+
1. **Automated Debugging**: AI analyzes code without manual intervention
289+
2. **Reduced MTTR**: Faster root cause identification
290+
3. **Context-Aware**: Multi-repo analysis for complete picture
291+
4. **No Duplicates**: Smart issue tracking prevents clutter
292+
5. **Actionable Insights**: Clear recommendations for fixes
293+
6. **Scalable**: Easy to add new error categories
294+
295+
## Learn More
296+
297+
- [Datadog API Documentation](https://docs.datadoghq.com/api/)
298+
- [GitHub Actions Documentation](https://docs.github.com/en/actions)
299+
- [OpenHands SDK Documentation](https://github.com/OpenHands/software-agent-sdk)

0 commit comments

Comments
 (0)