|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +validate-xml is a high-performance XML schema validator written in Rust. It validates thousands of XML files against XSD schemas using concurrent processing and intelligent two-tier caching (memory + disk). Built with libxml2 FFI bindings and async I/O throughout. |
| 8 | + |
| 9 | +**Key Performance**: Validates 20,000 files in ~2 seconds (cached) or ~30 seconds (first run with schema downloads). |
| 10 | + |
| 11 | +## Common Commands |
| 12 | + |
| 13 | +### Building and Testing |
| 14 | + |
| 15 | +```bash |
| 16 | +# Development build |
| 17 | +cargo build |
| 18 | + |
| 19 | +# Release build (optimized) |
| 20 | +cargo build --release |
| 21 | + |
| 22 | +# Run all tests (deterministic, no network calls) |
| 23 | +cargo test |
| 24 | + |
| 25 | +# Run a specific test |
| 26 | +cargo test test_name |
| 27 | + |
| 28 | +# Run tests with output visible |
| 29 | +cargo test -- --nocapture |
| 30 | + |
| 31 | +# Run only library tests (fastest) |
| 32 | +cargo test --lib |
| 33 | + |
| 34 | +# Run ignored network tests (requires internet) |
| 35 | +cargo test -- --ignored |
| 36 | + |
| 37 | +# Run a single test file |
| 38 | +cargo test --test http_client_test |
| 39 | +``` |
| 40 | + |
| 41 | +### Running the Binary |
| 42 | + |
| 43 | +```bash |
| 44 | +# Run with development build |
| 45 | +cargo run -- /path/to/xml/files |
| 46 | + |
| 47 | +# Run with release build (much faster) |
| 48 | +cargo run --release -- /path/to/xml/files |
| 49 | + |
| 50 | +# With options |
| 51 | +cargo run --release -- --verbose --extensions xml,cmdi /path/to/files |
| 52 | + |
| 53 | +# With debug logging |
| 54 | +RUST_LOG=debug cargo run -- /path/to/files |
| 55 | +``` |
| 56 | + |
| 57 | +### Code Quality |
| 58 | + |
| 59 | +```bash |
| 60 | +# Format code |
| 61 | +cargo fmt |
| 62 | + |
| 63 | +# Check formatting without changes |
| 64 | +cargo fmt --check |
| 65 | + |
| 66 | +# Run clippy linter |
| 67 | +cargo clippy |
| 68 | + |
| 69 | +# Fix clippy warnings automatically |
| 70 | +cargo clippy --fix |
| 71 | +``` |
| 72 | + |
| 73 | +## Architecture |
| 74 | + |
| 75 | +### Core Components |
| 76 | + |
| 77 | +The codebase follows a modular async-first architecture with clear separation of concerns: |
| 78 | + |
| 79 | +1. **File Discovery** (`file_discovery.rs`) |
| 80 | + - Recursively traverses directories to find XML files |
| 81 | + - Filters by extension using glob patterns |
| 82 | + - Single-threaded sequential operation |
| 83 | + |
| 84 | +2. **Schema Loading** (`schema_loader.rs`) |
| 85 | + - Extracts schema URLs from XML using regex (xsi:schemaLocation, xsi:noNamespaceSchemaLocation) |
| 86 | + - Downloads remote schemas via async HTTP client |
| 87 | + - Validates schema content before caching |
| 88 | + - Integrates with two-tier cache system |
| 89 | + |
| 90 | +3. **Two-Tier Caching** (`cache.rs`) |
| 91 | + - **L1 (Memory)**: moka cache for in-run reuse (microsecond lookups) |
| 92 | + - **L2 (Disk)**: cacache for cross-run persistence (millisecond lookups) |
| 93 | + - Thread-safe via Arc wrapping |
| 94 | + - Configurable TTL and size limits |
| 95 | + |
| 96 | +4. **Validation Engine** (`validator.rs`) |
| 97 | + - **Hybrid architecture**: Async I/O orchestration + sync CPU-bound validation |
| 98 | + - Spawns concurrent async tasks (bounded by semaphore) |
| 99 | + - Each task: load XML → fetch schema → validate via libxml2 (synchronous, thread-safe) |
| 100 | + - Collects results and statistics |
| 101 | + - Default concurrency = CPU core count |
| 102 | + |
| 103 | +5. **libxml2 FFI** (`libxml2.rs`) |
| 104 | + - Safe Rust wrappers around unsafe C FFI calls |
| 105 | + - Memory management via RAII patterns |
| 106 | + - Schema parsing and XML validation |
| 107 | + - **CRITICAL Thread Safety**: |
| 108 | + - Schema parsing is NOT thread-safe (serialized via cache) |
| 109 | + - Validation IS thread-safe (parallel execution, no global locks) |
| 110 | + |
| 111 | +6. **Error Handling** (`error.rs`, `error_reporter.rs`) |
| 112 | + - Structured error types using thiserror |
| 113 | + - Context-rich error messages with recovery hints |
| 114 | + - Line/column precision for validation errors |
| 115 | + - Both human-readable and JSON output formats |
| 116 | + |
| 117 | +7. **Configuration** (`config.rs`) |
| 118 | + - Environment variable support via `EnvProvider` trait pattern |
| 119 | + - File-based config (TOML/JSON) |
| 120 | + - CLI argument merging (CLI > env > file > defaults) |
| 121 | + - **IMPORTANT**: Uses dependency injection for testability |
| 122 | + |
| 123 | +### Data Flow |
| 124 | + |
| 125 | +``` |
| 126 | +CLI Args → Config Merge → File Discovery → Schema Extraction |
| 127 | + ↓ |
| 128 | + Schema Cache Check |
| 129 | + (L1 → L2 → HTTP) |
| 130 | + ↓ |
| 131 | + Concurrent Validation Tasks |
| 132 | + (bounded by semaphore) |
| 133 | + ↓ |
| 134 | + Error Aggregation → Output |
| 135 | + (Text or JSON format) |
| 136 | +``` |
| 137 | + |
| 138 | +### Key Design Patterns |
| 139 | + |
| 140 | +1. **Async-First**: All I/O operations use tokio async runtime |
| 141 | +2. **Dependency Injection**: Config system uses `EnvProvider` trait for testability |
| 142 | +3. **Two-Tier Caching**: Memory (fast) + Disk (persistent) for optimal performance |
| 143 | +4. **Bounded Concurrency**: Semaphore limits prevent resource exhaustion |
| 144 | +5. **RAII for FFI**: Proper cleanup of libxml2 resources via Drop trait |
| 145 | + |
| 146 | +## Testing Philosophy |
| 147 | + |
| 148 | +### Test Structure |
| 149 | + |
| 150 | +The project has **214+ passing tests** organized as: |
| 151 | +- **115 unit tests** in `src/` modules (fast, no I/O) |
| 152 | +- **99 integration tests** in `tests/` (slower, includes I/O simulation) |
| 153 | +- **24 ignored tests** (network-dependent, run explicitly with `--ignored`) |
| 154 | + |
| 155 | +### Critical Testing Rules |
| 156 | + |
| 157 | +1. **No Unsafe Code in Tests**: All environment variable manipulation must use `MockEnvProvider` pattern (see `src/config.rs` tests) |
| 158 | + |
| 159 | +2. **No Real Network Calls**: Tests making HTTP requests to external services (httpbin.org) must be marked `#[ignore]` |
| 160 | + ```rust |
| 161 | + #[tokio::test] |
| 162 | + #[ignore] // Requires internet connectivity - run with: cargo test -- --ignored |
| 163 | + async fn test_network_operation() { ... } |
| 164 | + ``` |
| 165 | + |
| 166 | +3. **Deterministic Tests Only**: Never use: |
| 167 | + - `tokio::time::sleep()` without proper synchronization |
| 168 | + - `tokio::spawn()` without waiting for completion |
| 169 | + - Real system time for timing assertions |
| 170 | + |
| 171 | +4. **Race Condition Prevention**: When testing concurrent code, use proper synchronization: |
| 172 | + ```rust |
| 173 | + // BAD: Race condition |
| 174 | + tokio::spawn(async move { /* ... */ }); |
| 175 | + tokio::time::sleep(Duration::from_millis(50)).await; // Hope it finishes |
| 176 | + |
| 177 | + // GOOD: Proper synchronization |
| 178 | + let handle = tokio::spawn(async move { /* ... */ }); |
| 179 | + handle.await.unwrap(); // Wait for completion |
| 180 | + ``` |
| 181 | + |
| 182 | +### Running Flaky/Network Tests |
| 183 | + |
| 184 | +Network tests are ignored by default to ensure CI reliability: |
| 185 | +```bash |
| 186 | +# Run only network tests |
| 187 | +cargo test -- --ignored |
| 188 | + |
| 189 | +# Run all tests including network tests |
| 190 | +cargo test -- --include-ignored |
| 191 | +``` |
| 192 | + |
| 193 | +## Environment Variables |
| 194 | + |
| 195 | +The config system supports environment variable overrides: |
| 196 | + |
| 197 | +```bash |
| 198 | +# Cache configuration |
| 199 | +export VALIDATE_XML_CACHE_DIR=/custom/cache |
| 200 | +export VALIDATE_XML_CACHE_TTL=48 |
| 201 | + |
| 202 | +# Validation settings |
| 203 | +export VALIDATE_XML_THREADS=4 |
| 204 | +export VALIDATE_XML_TIMEOUT=120 |
| 205 | + |
| 206 | +# Output settings |
| 207 | +export VALIDATE_XML_VERBOSE=true |
| 208 | +export VALIDATE_XML_FORMAT=json |
| 209 | +``` |
| 210 | + |
| 211 | +## libxml2 FFI Critical Notes |
| 212 | + |
| 213 | +When working with `libxml2.rs`: |
| 214 | + |
| 215 | +1. **Memory Safety**: All pointers must be checked for null before dereferencing |
| 216 | +2. **Cleanup**: Schema contexts must be freed via `xmlSchemaFree` in Drop implementations |
| 217 | +3. **Thread Safety** (see ARCHITECTURE_CHANGES.md for details): |
| 218 | + - **Schema parsing** (`xmlSchemaParse`): NOT thread-safe, serialized via cache |
| 219 | + - **Validation** (`xmlSchemaValidateFile`): IS thread-safe, runs in parallel |
| 220 | + - Arc-wrapped schemas enable safe sharing across tasks |
| 221 | + - Each validation creates its own context (per-task isolation) |
| 222 | +4. **Error Handling**: libxml2 prints errors to stderr - this is expected in tests (e.g., "Schemas parser error" messages) |
| 223 | + |
| 224 | +Example safe pattern: |
| 225 | +```rust |
| 226 | +impl Drop for SchemaContext { |
| 227 | + fn drop(&mut self) { |
| 228 | + unsafe { |
| 229 | + if !self.schema.is_null() { |
| 230 | + xmlSchemaFree(self.schema); |
| 231 | + } |
| 232 | + } |
| 233 | + } |
| 234 | +} |
| 235 | +``` |
| 236 | + |
| 237 | +## Dependency Injection Pattern |
| 238 | + |
| 239 | +For testability, the config system uses trait-based dependency injection: |
| 240 | + |
| 241 | +```rust |
| 242 | +// Production: uses real environment variables |
| 243 | +ConfigManager::apply_environment_overrides(config) |
| 244 | + |
| 245 | +// Testing: uses mock provider (no unsafe code) |
| 246 | +let mut mock_env = MockEnvProvider::new(); |
| 247 | +mock_env.set("VALIDATE_XML_THREADS", "16"); |
| 248 | +ConfigManager::apply_environment_overrides_with(&mock_env, config) |
| 249 | +``` |
| 250 | + |
| 251 | +**Never** use `std::env::set_var` or `std::env::remove_var` in tests - always use `MockEnvProvider`. |
| 252 | + |
| 253 | +## Performance Considerations |
| 254 | + |
| 255 | +1. **Schema Caching**: First run downloads schemas (~30s for 20k files), subsequent runs use cache (~2s) |
| 256 | +2. **Concurrency**: Default = CPU cores, but can be limited for memory-constrained systems |
| 257 | +3. **Memory**: Bounded by L1 cache size (default 100 entries) and concurrent task count |
| 258 | +4. **Network**: HTTP client uses connection pooling and retry logic with exponential backoff |
| 259 | + |
| 260 | +## Common Gotchas |
| 261 | + |
| 262 | +1. **libxml2 Errors to stderr**: The message "Schemas parser error : The XML document 'in_memory_buffer' is not a schema document" is EXPECTED in test output - it's from tests validating error handling |
| 263 | + |
| 264 | +2. **Timing Tests**: Any test using `tokio::time::sleep()` is likely flaky - refactor to use proper synchronization |
| 265 | + |
| 266 | +3. **Environment Pollution**: Tests must not modify global environment state - use `MockEnvProvider` pattern |
| 267 | + |
| 268 | +4. **Ignored Tests**: Running full test suite may show "24 ignored" - this is correct (network tests) |
| 269 | + |
| 270 | +## Code Generation and AI Assistance |
| 271 | + |
| 272 | +This project was collaboratively developed with Claude Code. When making changes: |
| 273 | + |
| 274 | +1. Maintain the existing architecture patterns (async-first, dependency injection, trait-based abstractions) |
| 275 | +2. Add tests for all new functionality (aim for 100% coverage) |
| 276 | +3. Update documentation strings for public APIs |
| 277 | +4. Run full test suite before committing: `cargo test && cargo clippy` |
| 278 | +5. For network-dependent code, mark tests with `#[ignore]` and document why |
0 commit comments