Skip to content

Commit 54af04e

Browse files
committed
Evals refactoring. Only simple test works.
1 parent 6f33f63 commit 54af04e

File tree

15 files changed

+1301
-490
lines changed

15 files changed

+1301
-490
lines changed

eval-server/nodejs/CLAUDE.md

Lines changed: 192 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,12 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44

55
## Project Overview
66

7-
bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses.
7+
bo-eval-server is a thin WebSocket and REST API server for LLM agent evaluation. The server provides:
8+
- WebSocket server for agent connections and RPC communication
9+
- REST APIs for browser automation via Chrome DevTools Protocol (CDP)
10+
- Screenshot capture and page content retrieval
11+
12+
**Evaluation orchestration and LLM-as-a-judge logic lives in the separate `evals/` Python project**, which calls these APIs.
813

914
## Commands
1015

@@ -49,10 +54,11 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
4954
- Calls `Evaluate(request: String) -> String` method on connected agents
5055
- Supports `configure_llm` method for dynamic LLM provider configuration
5156

52-
**LLM Evaluator** (`src/evaluator.js`)
53-
- Integrates with OpenAI API for LLM-as-a-judge functionality
54-
- Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness)
55-
- Returns structured JSON evaluation with scores and reasoning
57+
**CDP Integration** (`src/lib/EvalServer.js`)
58+
- Direct Chrome DevTools Protocol communication
59+
- Screenshot capture via `Page.captureScreenshot`
60+
- Page content access via `Runtime.evaluate`
61+
- Tab management via `Target.createTarget` / `Target.closeTarget`
5662

5763
**Logger** (`src/logger.js`)
5864
- Structured logging using Winston
@@ -62,12 +68,18 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
6268

6369
### Evaluation Flow
6470

71+
**WebSocket RPC Flow:**
6572
1. Agent connects to WebSocket server
6673
2. Agent sends "ready" signal
6774
3. Server calls agent's `Evaluate` method with a task
6875
4. Agent processes task and returns response
69-
5. Server sends response to LLM judge for evaluation
70-
6. Results are logged as JSON with scores and detailed feedback
76+
5. Response is returned to caller (evaluation orchestration happens externally in `evals/`)
77+
78+
**REST API Flow (for screenshot/content capture):**
79+
1. External caller (e.g., Python evals runner) requests screenshot via `POST /page/screenshot`
80+
2. Server uses CDP to capture screenshot
81+
3. Returns base64-encoded image data
82+
4. External caller uses screenshots for LLM-as-a-judge visual verification
7183

7284
### Project Structure
7385

@@ -86,13 +98,29 @@ logs/ # Log files (created automatically)
8698
└── evaluations.jsonl # Evaluation results in JSON Lines format
8799
```
88100

101+
### Architecture: Separation of Concerns
102+
103+
**eval-server (Node.js)**: Thin API layer
104+
- WebSocket server for agent connections
105+
- JSON-RPC 2.0 bidirectional communication
106+
- REST APIs for CDP operations (screenshots, page content, tab management)
107+
- NO evaluation logic, NO judges, NO test orchestration
108+
109+
**evals (Python)**: Evaluation orchestration and judging
110+
- LLM judges (LLMJudge, VisionJudge) in `lib/judge.py`
111+
- Evaluation runners that call eval-server APIs
112+
- Test case definitions (YAML files in `data/`)
113+
- Result reporting and analysis
114+
115+
This separation keeps eval-server focused on infrastructure while evals/ handles business logic.
116+
89117
### Key Features
90118

91119
- **Bidirectional RPC**: Server can call methods on connected clients
92-
- **Multi-Provider LLM Support**: Support for OpenAI, Groq, OpenRouter, and LiteLLM providers
120+
- **Multi-Provider LLM Support**: Support for OpenAI, Groq, OpenRouter, and LiteLLM providers (configured by clients)
93121
- **Dynamic LLM Configuration**: Runtime configuration via `configure_llm` JSON-RPC method
94122
- **Per-Client Configuration**: Each connected client can have different LLM settings
95-
- **LLM-as-a-Judge**: Automated evaluation of agent responses using configurable LLM providers
123+
- **CDP Browser Automation**: Screenshot capture, page content access, tab management
96124
- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations
97125
- **Structured Logging**: All interactions logged as JSON for analysis
98126
- **Interactive CLI**: Built-in CLI for testing and server management
@@ -277,21 +305,83 @@ Response format:
277305
}
278306
```
279307

308+
**Get Page Content**
309+
```bash
310+
POST /page/content
311+
Content-Type: application/json
312+
313+
{
314+
"clientId": "baseClientId",
315+
"tabId": "targetTabId",
316+
"format": "html" // or "text"
317+
}
318+
```
319+
320+
Retrieves the HTML or text content of a specific tab.
321+
322+
Response format:
323+
```json
324+
{
325+
"clientId": "baseClientId",
326+
"tabId": "targetTabId",
327+
"content": "<html>...</html>",
328+
"format": "html",
329+
"length": 12345,
330+
"timestamp": 1234567890
331+
}
332+
```
333+
334+
**Capture Screenshot**
335+
```bash
336+
POST /page/screenshot
337+
Content-Type: application/json
338+
339+
{
340+
"clientId": "baseClientId",
341+
"tabId": "targetTabId",
342+
"fullPage": false
343+
}
344+
```
345+
346+
Captures a screenshot of a specific tab.
347+
348+
Response format:
349+
```json
350+
{
351+
"clientId": "baseClientId",
352+
"tabId": "targetTabId",
353+
"imageData": "data:image/png;base64,iVBORw0KG...",
354+
"format": "png",
355+
"fullPage": false,
356+
"timestamp": 1234567890
357+
}
358+
```
359+
280360
#### Implementation Architecture
281361

282362
**Direct CDP Approach (Current)**
283363

284-
Tab management is implemented using direct Chrome DevTools Protocol (CDP) communication:
364+
Tab management and page content access are implemented using direct Chrome DevTools Protocol (CDP) communication:
285365

286366
1. Server discovers the CDP WebSocket endpoint via `http://localhost:9223/json/version`
287-
2. For each command (open/close), a new WebSocket connection is established to the CDP endpoint
367+
2. For each command, a new WebSocket connection is established to the CDP endpoint
288368
3. Commands are sent using JSON-RPC 2.0 format:
289-
- `Target.createTarget` - Opens new tab
290-
- `Target.closeTarget` - Closes existing tab
291-
4. WebSocket connection is closed after receiving the response
369+
- **Browser-level operations** (use `sendCDPCommand`):
370+
- `Target.createTarget` - Opens new tab
371+
- `Target.closeTarget` - Closes existing tab
372+
- **Tab-level operations** (use `sendCDPCommandToTarget`):
373+
- `Runtime.evaluate` - Execute JavaScript to get page content
374+
- `Page.captureScreenshot` - Capture screenshot of tab
375+
4. For tab-level operations, the server first attaches to the target, executes the command, then detaches
376+
5. WebSocket connection is closed after receiving the response
292377

293378
Key implementation files:
294-
- `src/lib/EvalServer.js` - Contains `sendCDPCommand()`, `openTab()`, and `closeTab()` methods
379+
- `src/lib/EvalServer.js` - Contains CDP methods:
380+
- `sendCDPCommand()` - Browser-level CDP commands
381+
- `sendCDPCommandToTarget()` - Tab-level CDP commands (with attach/detach)
382+
- `openTab()`, `closeTab()` - Tab management
383+
- `getPageHTML()`, `getPageText()` - Page content access
384+
- `captureScreenshot()` - Screenshot capture
295385
- `src/api-server.js` - REST API endpoints that delegate to EvalServer methods
296386

297387
**Alternative Approach Considered**
@@ -314,6 +404,81 @@ The CDP endpoint is accessible at:
314404
- HTTP: `http://localhost:9223/json/version`
315405
- WebSocket: `ws://localhost:9223/devtools/browser/{browserId}`
316406

407+
#### Usage Examples
408+
409+
**Complete workflow: Open tab, get content, take screenshot, close tab**
410+
411+
```bash
412+
# 1. Get list of clients
413+
curl -X GET http://localhost:8081/clients
414+
415+
# 2. Open a new tab
416+
curl -X POST http://localhost:8081/tabs/open \
417+
-H "Content-Type: application/json" \
418+
-d '{"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","url":"https://example.com"}'
419+
420+
# Response: {"tabId":"ABC123DEF456",...}
421+
422+
# 3. Get page HTML content
423+
curl -X POST http://localhost:8081/page/content \
424+
-H "Content-Type: application/json" \
425+
-d '{"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","format":"html"}'
426+
427+
# 4. Get page text content
428+
curl -X POST http://localhost:8081/page/content \
429+
-H "Content-Type: application/json" \
430+
-d '{"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","format":"text"}'
431+
432+
# 5. Capture screenshot
433+
curl -X POST http://localhost:8081/page/screenshot \
434+
-H "Content-Type: application/json" \
435+
-d '{"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","fullPage":false}'
436+
437+
# 6. Close the tab
438+
curl -X POST http://localhost:8081/tabs/close \
439+
-H "Content-Type: application/json" \
440+
-d '{"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456"}'
441+
```
442+
443+
**LLM-as-a-Judge evaluation pattern**
444+
445+
This workflow replicates the DevTools evaluation pattern using the eval-server:
446+
447+
```bash
448+
# 1. Open tab and navigate to test URL
449+
TAB_RESPONSE=$(curl -X POST http://localhost:8081/tabs/open \
450+
-H "Content-Type: application/json" \
451+
-d '{"clientId":"CLIENT_ID","url":"https://www.w3.org/WAI/ARIA/apg/patterns/button/examples/button/"}')
452+
453+
TAB_ID=$(echo $TAB_RESPONSE | jq -r '.tabId')
454+
455+
# 2. Capture BEFORE screenshot
456+
BEFORE_SCREENSHOT=$(curl -X POST http://localhost:8081/page/screenshot \
457+
-H "Content-Type: application/json" \
458+
-d "{\"clientId\":\"CLIENT_ID\",\"tabId\":\"$TAB_ID\",\"fullPage\":false}")
459+
460+
# 3. Execute agent action (via /v1/responses or custom endpoint)
461+
# ... agent performs action ...
462+
463+
# 4. Capture AFTER screenshot
464+
AFTER_SCREENSHOT=$(curl -X POST http://localhost:8081/page/screenshot \
465+
-H "Content-Type: application/json" \
466+
-d "{\"clientId\":\"CLIENT_ID\",\"tabId\":\"$TAB_ID\",\"fullPage\":false}")
467+
468+
# 5. Get page content for verification
469+
PAGE_CONTENT=$(curl -X POST http://localhost:8081/page/content \
470+
-H "Content-Type: application/json" \
471+
-d "{\"clientId\":\"CLIENT_ID\",\"tabId\":\"$TAB_ID\",\"format\":\"text\"}")
472+
473+
# 6. Send to LLM judge with screenshots and content
474+
# (Use OpenAI Vision API or similar with before/after screenshots)
475+
476+
# 7. Clean up
477+
curl -X POST http://localhost:8081/tabs/close \
478+
-H "Content-Type: application/json" \
479+
-d "{\"clientId\":\"CLIENT_ID\",\"tabId\":\"$TAB_ID\"}"
480+
```
481+
317482
#### Current Limitations
318483

319484
**⚠️ Known Issue: WebSocket Timeout**
@@ -333,13 +498,25 @@ The CDP endpoint is correctly discovered and accessible, but WebSocket messages
333498

334499
**Workaround**: Until this issue is resolved, tab management via the API is not functional. Manual CDP testing is required to diagnose the root cause.
335500

501+
#### Features Implemented
502+
503+
- ✅ Page HTML/text content access via CDP
504+
- ✅ Screenshot capture via CDP
505+
- ✅ Direct CDP communication for tab management
506+
- ✅ Tab-level CDP command execution with attach/detach
507+
336508
#### Future Enhancements
337509

338510
- Automatic tab registration in ClientManager when DevTools connects
339511
- Tab lifecycle events (opened, closed, navigated)
340512
- Bulk tab operations
341513
- Tab metadata (title, URL, favicon)
342514
- Tab grouping and organization
515+
- Additional CDP methods:
516+
- JavaScript execution with custom expressions
517+
- DOM tree access (`DOM.getDocument`)
518+
- MHTML snapshots (`Page.captureSnapshot`)
519+
- PDF generation (`Page.printToPDF`)
343520

344521
### Configuration
345522

eval-server/nodejs/examples/with-http-wrapper.js

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ const evalServer = new EvalServer({
1818

1919
console.log('🔧 Creating HTTP wrapper...');
2020
const httpWrapper = new HTTPWrapper(evalServer, {
21-
port: 8080,
21+
port: 8083,
2222
host: '0.0.0.0'
2323
});
2424

@@ -29,11 +29,11 @@ console.log('✅ EvalServer started on ws://127.0.0.1:8082');
2929

3030
console.log('🔧 Starting HTTP wrapper...');
3131
await httpWrapper.start();
32-
console.log('✅ HTTP API started on http://127.0.0.1:8080');
32+
console.log('✅ HTTP API started on http://127.0.0.1:8083');
3333

3434
console.log('⏳ Waiting for DevTools client to connect...');
3535
console.log(' WebSocket URL: ws://127.0.0.1:8082');
36-
console.log(' HTTP API URL: http://127.0.0.1:8080');
36+
console.log(' HTTP API URL: http://127.0.0.1:8083');
3737
console.log(' Auth: Disabled (automated mode)');
3838

3939
// Add periodic status check

eval-server/nodejs/src/api-server.js

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,22 @@ class APIServer {
142142
result = await this.handleResponsesRequest(JSON.parse(body));
143143
break;
144144

145+
case '/page/content':
146+
if (method !== 'POST') {
147+
this.sendError(res, 405, 'Method not allowed');
148+
return;
149+
}
150+
result = await this.getPageContent(JSON.parse(body));
151+
break;
152+
153+
case '/page/screenshot':
154+
if (method !== 'POST') {
155+
this.sendError(res, 405, 'Method not allowed');
156+
return;
157+
}
158+
result = await this.getScreenshot(JSON.parse(body));
159+
break;
160+
145161
default:
146162
this.sendError(res, 404, 'Not found');
147163
return;
@@ -349,6 +365,67 @@ class APIServer {
349365
};
350366
}
351367

368+
async getPageContent(payload) {
369+
const { clientId, tabId, format = 'html' } = payload;
370+
371+
if (!clientId) {
372+
throw new Error('Client ID is required');
373+
}
374+
375+
if (!tabId) {
376+
throw new Error('Tab ID is required');
377+
}
378+
379+
if (!['html', 'text'].includes(format)) {
380+
throw new Error('Format must be either "html" or "text"');
381+
}
382+
383+
const baseClientId = clientId.split(':')[0];
384+
385+
logger.info('Getting page content', { baseClientId, tabId, format });
386+
387+
// Call appropriate method based on format
388+
const result = format === 'html'
389+
? await this.evaluationServer.getPageHTML(tabId)
390+
: await this.evaluationServer.getPageText(tabId);
391+
392+
return {
393+
clientId: baseClientId,
394+
tabId: result.tabId,
395+
content: result.content,
396+
format: result.format,
397+
length: result.length,
398+
timestamp: Date.now()
399+
};
400+
}
401+
402+
async getScreenshot(payload) {
403+
const { clientId, tabId, fullPage = false } = payload;
404+
405+
if (!clientId) {
406+
throw new Error('Client ID is required');
407+
}
408+
409+
if (!tabId) {
410+
throw new Error('Tab ID is required');
411+
}
412+
413+
const baseClientId = clientId.split(':')[0];
414+
415+
logger.info('Capturing screenshot', { baseClientId, tabId, fullPage });
416+
417+
const result = await this.evaluationServer.captureScreenshot(tabId, { fullPage });
418+
419+
return {
420+
clientId: baseClientId,
421+
tabId: result.tabId,
422+
imageData: result.imageData,
423+
format: result.format,
424+
fullPage: result.fullPage,
425+
timestamp: Date.now()
426+
};
427+
}
428+
352429
/**
353430
* Handle OpenAI Responses API compatible requests with nested model format
354431
*/

0 commit comments

Comments
 (0)