@@ -4,7 +4,12 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44
55## Project Overview
66
7- bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses.
7+ bo-eval-server is a thin WebSocket and REST API server for LLM agent evaluation. The server provides:
8+ - WebSocket server for agent connections and RPC communication
9+ - REST APIs for browser automation via Chrome DevTools Protocol (CDP)
10+ - Screenshot capture and page content retrieval
11+
12+ ** Evaluation orchestration and LLM-as-a-judge logic lives in the separate ` evals/ ` Python project** , which calls these APIs.
813
914## Commands
1015
@@ -49,10 +54,11 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
4954- Calls ` Evaluate(request: String) -> String ` method on connected agents
5055- Supports ` configure_llm ` method for dynamic LLM provider configuration
5156
52- ** LLM Evaluator** (` src/evaluator.js ` )
53- - Integrates with OpenAI API for LLM-as-a-judge functionality
54- - Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness)
55- - Returns structured JSON evaluation with scores and reasoning
57+ ** CDP Integration** (` src/lib/EvalServer.js ` )
58+ - Direct Chrome DevTools Protocol communication
59+ - Screenshot capture via ` Page.captureScreenshot `
60+ - Page content access via ` Runtime.evaluate `
61+ - Tab management via ` Target.createTarget ` / ` Target.closeTarget `
5662
5763** Logger** (` src/logger.js ` )
5864- Structured logging using Winston
@@ -62,12 +68,18 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
6268
6369### Evaluation Flow
6470
71+ ** WebSocket RPC Flow:**
65721 . Agent connects to WebSocket server
66732 . Agent sends "ready" signal
67743 . Server calls agent's ` Evaluate ` method with a task
68754 . Agent processes task and returns response
69- 5 . Server sends response to LLM judge for evaluation
70- 6 . Results are logged as JSON with scores and detailed feedback
76+ 5 . Response is returned to caller (evaluation orchestration happens externally in ` evals/ ` )
77+
78+ ** REST API Flow (for screenshot/content capture):**
79+ 1 . External caller (e.g., Python evals runner) requests screenshot via ` POST /page/screenshot `
80+ 2 . Server uses CDP to capture screenshot
81+ 3 . Returns base64-encoded image data
82+ 4 . External caller uses screenshots for LLM-as-a-judge visual verification
7183
7284### Project Structure
7385
@@ -86,13 +98,29 @@ logs/ # Log files (created automatically)
8698└── evaluations.jsonl # Evaluation results in JSON Lines format
8799```
88100
101+ ### Architecture: Separation of Concerns
102+
103+ ** eval-server (Node.js)** : Thin API layer
104+ - WebSocket server for agent connections
105+ - JSON-RPC 2.0 bidirectional communication
106+ - REST APIs for CDP operations (screenshots, page content, tab management)
107+ - NO evaluation logic, NO judges, NO test orchestration
108+
109+ ** evals (Python)** : Evaluation orchestration and judging
110+ - LLM judges (LLMJudge, VisionJudge) in ` lib/judge.py `
111+ - Evaluation runners that call eval-server APIs
112+ - Test case definitions (YAML files in ` data/ ` )
113+ - Result reporting and analysis
114+
115+ This separation keeps eval-server focused on infrastructure while evals/ handles business logic.
116+
89117### Key Features
90118
91119- ** Bidirectional RPC** : Server can call methods on connected clients
92- - ** Multi-Provider LLM Support** : Support for OpenAI, Groq, OpenRouter, and LiteLLM providers
120+ - ** Multi-Provider LLM Support** : Support for OpenAI, Groq, OpenRouter, and LiteLLM providers (configured by clients)
93121- ** Dynamic LLM Configuration** : Runtime configuration via ` configure_llm ` JSON-RPC method
94122- ** Per-Client Configuration** : Each connected client can have different LLM settings
95- - ** LLM-as-a-Judge ** : Automated evaluation of agent responses using configurable LLM providers
123+ - ** CDP Browser Automation ** : Screenshot capture, page content access, tab management
96124- ** Concurrent Evaluations** : Support for multiple agents and parallel evaluations
97125- ** Structured Logging** : All interactions logged as JSON for analysis
98126- ** Interactive CLI** : Built-in CLI for testing and server management
@@ -277,21 +305,83 @@ Response format:
277305}
278306```
279307
308+ ** Get Page Content**
309+ ``` bash
310+ POST /page/content
311+ Content-Type: application/json
312+
313+ {
314+ " clientId" : " baseClientId" ,
315+ " tabId" : " targetTabId" ,
316+ " format" : " html" // or " text"
317+ }
318+ ```
319+
320+ Retrieves the HTML or text content of a specific tab.
321+
322+ Response format:
323+ ``` json
324+ {
325+ "clientId" : " baseClientId" ,
326+ "tabId" : " targetTabId" ,
327+ "content" : " <html>...</html>" ,
328+ "format" : " html" ,
329+ "length" : 12345 ,
330+ "timestamp" : 1234567890
331+ }
332+ ```
333+
334+ ** Capture Screenshot**
335+ ``` bash
336+ POST /page/screenshot
337+ Content-Type: application/json
338+
339+ {
340+ " clientId" : " baseClientId" ,
341+ " tabId" : " targetTabId" ,
342+ " fullPage" : false
343+ }
344+ ```
345+
346+ Captures a screenshot of a specific tab.
347+
348+ Response format:
349+ ``` json
350+ {
351+ "clientId" : " baseClientId" ,
352+ "tabId" : " targetTabId" ,
353+ "imageData" : " ..." ,
354+ "format" : " png" ,
355+ "fullPage" : false ,
356+ "timestamp" : 1234567890
357+ }
358+ ```
359+
280360#### Implementation Architecture
281361
282362** Direct CDP Approach (Current)**
283363
284- Tab management is implemented using direct Chrome DevTools Protocol (CDP) communication:
364+ Tab management and page content access are implemented using direct Chrome DevTools Protocol (CDP) communication:
285365
2863661 . Server discovers the CDP WebSocket endpoint via ` http://localhost:9223/json/version `
287- 2 . For each command (open/close) , a new WebSocket connection is established to the CDP endpoint
367+ 2 . For each command, a new WebSocket connection is established to the CDP endpoint
2883683 . Commands are sent using JSON-RPC 2.0 format:
289- - ` Target.createTarget ` - Opens new tab
290- - ` Target.closeTarget ` - Closes existing tab
291- 4 . WebSocket connection is closed after receiving the response
369+ - ** Browser-level operations** (use ` sendCDPCommand ` ):
370+ - ` Target.createTarget ` - Opens new tab
371+ - ` Target.closeTarget ` - Closes existing tab
372+ - ** Tab-level operations** (use ` sendCDPCommandToTarget ` ):
373+ - ` Runtime.evaluate ` - Execute JavaScript to get page content
374+ - ` Page.captureScreenshot ` - Capture screenshot of tab
375+ 4 . For tab-level operations, the server first attaches to the target, executes the command, then detaches
376+ 5 . WebSocket connection is closed after receiving the response
292377
293378Key implementation files:
294- - ` src/lib/EvalServer.js ` - Contains ` sendCDPCommand() ` , ` openTab() ` , and ` closeTab() ` methods
379+ - ` src/lib/EvalServer.js ` - Contains CDP methods:
380+ - ` sendCDPCommand() ` - Browser-level CDP commands
381+ - ` sendCDPCommandToTarget() ` - Tab-level CDP commands (with attach/detach)
382+ - ` openTab() ` , ` closeTab() ` - Tab management
383+ - ` getPageHTML() ` , ` getPageText() ` - Page content access
384+ - ` captureScreenshot() ` - Screenshot capture
295385- ` src/api-server.js ` - REST API endpoints that delegate to EvalServer methods
296386
297387** Alternative Approach Considered**
@@ -314,6 +404,81 @@ The CDP endpoint is accessible at:
314404- HTTP: ` http://localhost:9223/json/version `
315405- WebSocket: ` ws://localhost:9223/devtools/browser/{browserId} `
316406
407+ #### Usage Examples
408+
409+ ** Complete workflow: Open tab, get content, take screenshot, close tab**
410+
411+ ``` bash
412+ # 1. Get list of clients
413+ curl -X GET http://localhost:8081/clients
414+
415+ # 2. Open a new tab
416+ curl -X POST http://localhost:8081/tabs/open \
417+ -H " Content-Type: application/json" \
418+ -d ' {"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","url":"https://example.com"}'
419+
420+ # Response: {"tabId":"ABC123DEF456",...}
421+
422+ # 3. Get page HTML content
423+ curl -X POST http://localhost:8081/page/content \
424+ -H " Content-Type: application/json" \
425+ -d ' {"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","format":"html"}'
426+
427+ # 4. Get page text content
428+ curl -X POST http://localhost:8081/page/content \
429+ -H " Content-Type: application/json" \
430+ -d ' {"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","format":"text"}'
431+
432+ # 5. Capture screenshot
433+ curl -X POST http://localhost:8081/page/screenshot \
434+ -H " Content-Type: application/json" \
435+ -d ' {"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456","fullPage":false}'
436+
437+ # 6. Close the tab
438+ curl -X POST http://localhost:8081/tabs/close \
439+ -H " Content-Type: application/json" \
440+ -d ' {"clientId":"9907fd8d-92a8-4a6a-bce9-458ec8c57306","tabId":"ABC123DEF456"}'
441+ ```
442+
443+ ** LLM-as-a-Judge evaluation pattern**
444+
445+ This workflow replicates the DevTools evaluation pattern using the eval-server:
446+
447+ ``` bash
448+ # 1. Open tab and navigate to test URL
449+ TAB_RESPONSE=$( curl -X POST http://localhost:8081/tabs/open \
450+ -H " Content-Type: application/json" \
451+ -d ' {"clientId":"CLIENT_ID","url":"https://www.w3.org/WAI/ARIA/apg/patterns/button/examples/button/"}' )
452+
453+ TAB_ID=$( echo $TAB_RESPONSE | jq -r ' .tabId' )
454+
455+ # 2. Capture BEFORE screenshot
456+ BEFORE_SCREENSHOT=$( curl -X POST http://localhost:8081/page/screenshot \
457+ -H " Content-Type: application/json" \
458+ -d " {\" clientId\" :\" CLIENT_ID\" ,\" tabId\" :\" $TAB_ID \" ,\" fullPage\" :false}" )
459+
460+ # 3. Execute agent action (via /v1/responses or custom endpoint)
461+ # ... agent performs action ...
462+
463+ # 4. Capture AFTER screenshot
464+ AFTER_SCREENSHOT=$( curl -X POST http://localhost:8081/page/screenshot \
465+ -H " Content-Type: application/json" \
466+ -d " {\" clientId\" :\" CLIENT_ID\" ,\" tabId\" :\" $TAB_ID \" ,\" fullPage\" :false}" )
467+
468+ # 5. Get page content for verification
469+ PAGE_CONTENT=$( curl -X POST http://localhost:8081/page/content \
470+ -H " Content-Type: application/json" \
471+ -d " {\" clientId\" :\" CLIENT_ID\" ,\" tabId\" :\" $TAB_ID \" ,\" format\" :\" text\" }" )
472+
473+ # 6. Send to LLM judge with screenshots and content
474+ # (Use OpenAI Vision API or similar with before/after screenshots)
475+
476+ # 7. Clean up
477+ curl -X POST http://localhost:8081/tabs/close \
478+ -H " Content-Type: application/json" \
479+ -d " {\" clientId\" :\" CLIENT_ID\" ,\" tabId\" :\" $TAB_ID \" }"
480+ ```
481+
317482#### Current Limitations
318483
319484** ⚠️ Known Issue: WebSocket Timeout**
@@ -333,13 +498,25 @@ The CDP endpoint is correctly discovered and accessible, but WebSocket messages
333498
334499** Workaround** : Until this issue is resolved, tab management via the API is not functional. Manual CDP testing is required to diagnose the root cause.
335500
501+ #### Features Implemented
502+
503+ - ✅ Page HTML/text content access via CDP
504+ - ✅ Screenshot capture via CDP
505+ - ✅ Direct CDP communication for tab management
506+ - ✅ Tab-level CDP command execution with attach/detach
507+
336508#### Future Enhancements
337509
338510- Automatic tab registration in ClientManager when DevTools connects
339511- Tab lifecycle events (opened, closed, navigated)
340512- Bulk tab operations
341513- Tab metadata (title, URL, favicon)
342514- Tab grouping and organization
515+ - Additional CDP methods:
516+ - JavaScript execution with custom expressions
517+ - DOM tree access (` DOM.getDocument ` )
518+ - MHTML snapshots (` Page.captureSnapshot ` )
519+ - PDF generation (` Page.printToPDF ` )
343520
344521### Configuration
345522
0 commit comments