|
| 1 | +# Extraction Modes Validation Summary |
| 2 | + |
| 3 | +This document summarizes the validation of the two extraction mechanisms in the Ax DSP system and how the recent fixes work correctly in both modes. |
| 4 | + |
| 5 | +## Two Extraction Modes |
| 6 | + |
| 7 | +### 1. Key-Value Format (`hasComplexFields = false`) |
| 8 | + |
| 9 | +**When it's used:** |
| 10 | +- Simple fields (string, number, boolean) |
| 11 | +- Arrays of simple types |
| 12 | +- Arrays of objects (with special handling) |
| 13 | + |
| 14 | +**LLM Output Format:** |
| 15 | +``` |
| 16 | +Field Name 1: value1 |
| 17 | +Field Name 2: value2 |
| 18 | +Array Field: ["item1", "item2"] |
| 19 | +``` |
| 20 | + |
| 21 | +**Extraction Mechanism:** |
| 22 | +- Uses `streamingExtractValues` to parse field prefixes (`Field Name:`) |
| 23 | +- Extracts values after each prefix |
| 24 | +- For arrays: tries JSON first, falls back to markdown list parsing |
| 25 | +- For object arrays: parses each markdown list item as JSON (NEW FIX) |
| 26 | + |
| 27 | +**Example:** |
| 28 | +```typescript |
| 29 | +const signature = f() |
| 30 | + .input('query', f.string()) |
| 31 | + .output('items', f.string().array()) |
| 32 | + .build(); |
| 33 | + |
| 34 | +// LLM outputs: |
| 35 | +// Items: |
| 36 | +// - apple |
| 37 | +// - banana |
| 38 | +``` |
| 39 | + |
| 40 | +### 2. JSON Format (`hasComplexFields = true`) |
| 41 | + |
| 42 | +**When it's used:** |
| 43 | +- When signature has object fields (non-array) |
| 44 | +- When signature has array of objects fields |
| 45 | +- When `useStructuredOutputs()` is explicitly called |
| 46 | + |
| 47 | +**LLM Output Format:** |
| 48 | +```json |
| 49 | +{ |
| 50 | + "field1": "value1", |
| 51 | + "field2": {"nested": "object"}, |
| 52 | + "arrayField": [{"id": 1}, {"id": 2}] |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +**Extraction Mechanism:** |
| 57 | +- Uses `parsePartialJson` in `processResponse.ts` |
| 58 | +- Parses streaming JSON output |
| 59 | +- Validates structured outputs against schema |
| 60 | + |
| 61 | +**Example:** |
| 62 | +```typescript |
| 63 | +const signature = f() |
| 64 | + .input('query', f.string()) |
| 65 | + .output('result', f.object({ name: f.string() })) |
| 66 | + .build(); |
| 67 | + |
| 68 | +// hasComplexFields() returns true |
| 69 | +// LLM outputs pure JSON |
| 70 | +``` |
| 71 | + |
| 72 | +## Fixes Applied |
| 73 | + |
| 74 | +### 1. Enhanced Array Parsing for Object Arrays |
| 75 | + |
| 76 | +**Location:** `extract.ts` - `validateAndParseFieldValue` |
| 77 | + |
| 78 | +**What it does:** |
| 79 | +- When parsing array items that should be objects/json, tries to parse each item as JSON |
| 80 | +- Calls `extractBlock` to extract JSON from code blocks (e.g., \`\`\`json {...} \`\`\`) |
| 81 | +- Only applies to `object` and `json` types, NOT to `string` types |
| 82 | + |
| 83 | +**Code:** |
| 84 | +```typescript |
| 85 | +if ( |
| 86 | + typeof v === 'string' && |
| 87 | + (field.type?.name === 'object' || |
| 88 | + (field.type?.name as string) === 'json') |
| 89 | +) { |
| 90 | + try { |
| 91 | + const jsonText = extractBlock(v); |
| 92 | + v = JSON.parse(jsonText); |
| 93 | + } catch { |
| 94 | + // Ignore parsing errors |
| 95 | + } |
| 96 | +} |
| 97 | +``` |
| 98 | + |
| 99 | +**Why it works in both modes:** |
| 100 | +- **Key-value mode:** Handles markdown lists where each item is a JSON object string |
| 101 | +- **JSON mode:** Not used (JSON mode uses `parsePartialJson` instead) |
| 102 | + |
| 103 | +### 2. Updated `extractBlock` Regex |
| 104 | + |
| 105 | +**Location:** `extract.ts` - `extractBlock` |
| 106 | + |
| 107 | +**What it does:** |
| 108 | +- Changed regex from `/```([A-Za-z]*)\n([\s\S]*?)\n```/g` to `/```([A-Za-z]*)\s*([\s\S]*?)\s*```/g` |
| 109 | +- Now supports single-line code blocks (e.g., \`\`\`json {...} \`\`\`) |
| 110 | + |
| 111 | +**Why this was needed:** |
| 112 | +- `parseMarkdownList` enforces single-line list items |
| 113 | +- Multi-line code blocks would trigger "mixed content detected" error |
| 114 | +- Single-line code blocks are more natural for LLM output in markdown lists |
| 115 | + |
| 116 | +### 3. Structured Output Features |
| 117 | + |
| 118 | +**Location:** `sig.ts`, `prompt.ts` |
| 119 | + |
| 120 | +**What it does:** |
| 121 | +- Added `useStructuredOutputs()` method to force JSON mode |
| 122 | +- Updated prompts to render examples as JSON when complex fields are enabled |
| 123 | +- Updated error correction prompts to request full JSON for complex fields |
| 124 | + |
| 125 | +**How it affects extraction:** |
| 126 | +- Sets `_forceComplexFields` flag on signature |
| 127 | +- Triggers JSON mode in `processResponse.ts` |
| 128 | +- LLM outputs pure JSON instead of key-value format |
| 129 | + |
| 130 | +## Behavior Matrix |
| 131 | + |
| 132 | +| Field Type | hasComplexFields | Extraction Mode | Example LLM Output | |
| 133 | +|------------|------------------|-----------------|-------------------| |
| 134 | +| `string` | false | Key-Value | `Name: John` | |
| 135 | +| `string[]` | false | Key-Value | `Items:\n- apple\n- banana` | |
| 136 | +| `object` | **true** | JSON | `{"name": "test", "age": 30}` | |
| 137 | +| `object[]` | **true** | JSON | `[{"id": 1}, {"id": 2}]` | |
| 138 | +| Any with `useStructuredOutputs()` | **true** | JSON | `{"field1": "value"}` | |
| 139 | + |
| 140 | +## Test Coverage |
| 141 | + |
| 142 | +### `extraction_modes_validation.test.ts` |
| 143 | +- ✅ Simple strings in key-value format |
| 144 | +- ✅ Multiple fields in key-value format |
| 145 | +- ✅ Arrays with JSON in key-value format |
| 146 | +- ✅ Arrays with markdown lists in key-value format |
| 147 | +- ✅ Object arrays with JSON strings in markdown lists |
| 148 | +- ✅ Structured outputs flag behavior |
| 149 | +- ✅ Object fields trigger complex mode |
| 150 | +- ✅ Array of objects extraction |
| 151 | +- ✅ Top-level array output |
| 152 | +- ✅ Code blocks in string arrays (preserved as-is) |
| 153 | +- ✅ Backward compatibility for simple types |
| 154 | + |
| 155 | +### `verification_fixes.test.ts` |
| 156 | +- ✅ Markdown list of JSON strings for object arrays |
| 157 | +- ✅ Markdown list of JSON strings for json arrays |
| 158 | +- ✅ Handling invalid JSON gracefully |
| 159 | + |
| 160 | +### `structured_output_features.test.ts` |
| 161 | +- ✅ `useStructuredOutputs()` sets complex fields flag |
| 162 | +- ✅ Examples rendered as JSON when structured outputs enabled |
| 163 | +- ✅ Error correction requests full JSON for complex fields |
| 164 | + |
| 165 | +### `extract.test.ts` (Existing Tests) |
| 166 | +- ✅ All 19 existing tests pass |
| 167 | +- ✅ No regressions |
| 168 | + |
| 169 | +## Key Insights |
| 170 | + |
| 171 | +1. **Object fields always trigger JSON mode**: Individual object fields (not arrays) set `hasComplexFields=true`, so the LLM outputs JSON, not key-value format. |
| 172 | + |
| 173 | +2. **String arrays preserve code blocks**: For string arrays, code block syntax is preserved. JSON extraction only happens for `object` and `json` types. |
| 174 | + |
| 175 | +3. **Array of objects works in key-value mode**: While individual objects trigger JSON mode, arrays of objects can work in key-value mode through markdown lists where each item is a JSON string. |
| 176 | + |
| 177 | +4. **Two separate parsing paths**: |
| 178 | + - **Key-value**: `streamingExtractValues` → `validateAndParseFieldValue` |
| 179 | + - **JSON**: `parsePartialJson` (in `processResponse.ts`) |
| 180 | + |
| 181 | +5. **The fixes are backward compatible**: All existing tests pass, and the new behavior only affects edge cases that previously threw errors. |
0 commit comments