Skip to content

Commit 9826b62

Browse files
committed
A lot of improvements. Thanks Claude.
1 parent 3e55448 commit 9826b62

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+16767
-753
lines changed

.github/dependabot.yml

Lines changed: 0 additions & 36 deletions
This file was deleted.

.github/workflows/ci.yml

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
name: CI
3+
4+
on:
5+
push:
6+
branches: [main, master]
7+
pull_request:
8+
branches: [main, master]
9+
10+
env:
11+
CARGO_TERM_COLOR: always
12+
13+
jobs:
14+
test:
15+
name: Build and Test
16+
runs-on: ${{ matrix.os }}
17+
strategy:
18+
fail-fast: false
19+
matrix:
20+
os: [ubuntu-latest, macos-latest, windows-latest]
21+
include:
22+
- os: ubuntu-latest
23+
libxml2_install: sudo apt-get update && sudo apt-get install -y libxml2-dev
24+
- os: macos-latest
25+
libxml2_install: brew install libxml2
26+
- os: windows-latest
27+
libxml2_install: |
28+
choco install libxml2
29+
echo "LIBXML2_LIB_DIR=C:\tools\libxml2\lib" >> $GITHUB_ENV
30+
echo "LIBXML2_INCLUDE_DIR=C:\tools\libxml2\include" >> $GITHUB_ENV
31+
32+
steps:
33+
- name: Checkout code
34+
uses: actions/checkout@v5
35+
36+
- name: Install Rust toolchain
37+
uses: dtolnay/rust-toolchain@stable
38+
with:
39+
components: rustfmt, clippy
40+
41+
- name: Cache Rust dependencies and build artifacts
42+
uses: Swatinem/rust-cache@v2
43+
44+
- name: Install libxml2
45+
run: ${{ matrix.libxml2_install }}
46+
47+
- name: Build
48+
run: cargo build --release
49+
50+
- name: Run tests
51+
run: cargo test
52+
53+
- name: Run clippy
54+
run: cargo clippy -- -D warnings
55+
56+
- name: Check formatting
57+
run: cargo fmt --check

.travis.yml

Lines changed: 0 additions & 12 deletions
This file was deleted.

CLAUDE.md

Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
validate-xml is a high-performance XML schema validator written in Rust. It validates thousands of XML files against XSD schemas using concurrent processing and intelligent two-tier caching (memory + disk). Built with libxml2 FFI bindings and async I/O throughout.
8+
9+
**Key Performance**: Validates 20,000 files in ~2 seconds (cached) or ~30 seconds (first run with schema downloads).
10+
11+
## Common Commands
12+
13+
### Building and Testing
14+
15+
```bash
16+
# Development build
17+
cargo build
18+
19+
# Release build (optimized)
20+
cargo build --release
21+
22+
# Run all tests (deterministic, no network calls)
23+
cargo test
24+
25+
# Run a specific test
26+
cargo test test_name
27+
28+
# Run tests with output visible
29+
cargo test -- --nocapture
30+
31+
# Run only library tests (fastest)
32+
cargo test --lib
33+
34+
# Run ignored network tests (requires internet)
35+
cargo test -- --ignored
36+
37+
# Run a single test file
38+
cargo test --test http_client_test
39+
```
40+
41+
### Running the Binary
42+
43+
```bash
44+
# Run with development build
45+
cargo run -- /path/to/xml/files
46+
47+
# Run with release build (much faster)
48+
cargo run --release -- /path/to/xml/files
49+
50+
# With options
51+
cargo run --release -- --verbose --extensions xml,cmdi /path/to/files
52+
53+
# With debug logging
54+
RUST_LOG=debug cargo run -- /path/to/files
55+
```
56+
57+
### Code Quality
58+
59+
```bash
60+
# Format code
61+
cargo fmt
62+
63+
# Check formatting without changes
64+
cargo fmt --check
65+
66+
# Run clippy linter
67+
cargo clippy
68+
69+
# Fix clippy warnings automatically
70+
cargo clippy --fix
71+
```
72+
73+
## Architecture
74+
75+
### Core Components
76+
77+
The codebase follows a modular async-first architecture with clear separation of concerns:
78+
79+
1. **File Discovery** (`file_discovery.rs`)
80+
- Recursively traverses directories to find XML files
81+
- Filters by extension using glob patterns
82+
- Single-threaded sequential operation
83+
84+
2. **Schema Loading** (`schema_loader.rs`)
85+
- Extracts schema URLs from XML using regex (xsi:schemaLocation, xsi:noNamespaceSchemaLocation)
86+
- Downloads remote schemas via async HTTP client
87+
- Validates schema content before caching
88+
- Integrates with two-tier cache system
89+
90+
3. **Two-Tier Caching** (`cache.rs`)
91+
- **L1 (Memory)**: moka cache for in-run reuse (microsecond lookups)
92+
- **L2 (Disk)**: cacache for cross-run persistence (millisecond lookups)
93+
- Thread-safe via Arc wrapping
94+
- Configurable TTL and size limits
95+
96+
4. **Validation Engine** (`validator.rs`)
97+
- **Hybrid architecture**: Async I/O orchestration + sync CPU-bound validation
98+
- Spawns concurrent async tasks (bounded by semaphore)
99+
- Each task: load XML → fetch schema → validate via libxml2 (synchronous, thread-safe)
100+
- Collects results and statistics
101+
- Default concurrency = CPU core count
102+
103+
5. **libxml2 FFI** (`libxml2.rs`)
104+
- Safe Rust wrappers around unsafe C FFI calls
105+
- Memory management via RAII patterns
106+
- Schema parsing and XML validation
107+
- **CRITICAL Thread Safety**:
108+
- Schema parsing is NOT thread-safe (serialized via cache)
109+
- Validation IS thread-safe (parallel execution, no global locks)
110+
111+
6. **Error Handling** (`error.rs`, `error_reporter.rs`)
112+
- Structured error types using thiserror
113+
- Context-rich error messages with recovery hints
114+
- Line/column precision for validation errors
115+
- Both human-readable and JSON output formats
116+
117+
7. **Configuration** (`config.rs`)
118+
- Environment variable support via `EnvProvider` trait pattern
119+
- File-based config (TOML/JSON)
120+
- CLI argument merging (CLI > env > file > defaults)
121+
- **IMPORTANT**: Uses dependency injection for testability
122+
123+
### Data Flow
124+
125+
```
126+
CLI Args → Config Merge → File Discovery → Schema Extraction
127+
128+
Schema Cache Check
129+
(L1 → L2 → HTTP)
130+
131+
Concurrent Validation Tasks
132+
(bounded by semaphore)
133+
134+
Error Aggregation → Output
135+
(Text or JSON format)
136+
```
137+
138+
### Key Design Patterns
139+
140+
1. **Async-First**: All I/O operations use tokio async runtime
141+
2. **Dependency Injection**: Config system uses `EnvProvider` trait for testability
142+
3. **Two-Tier Caching**: Memory (fast) + Disk (persistent) for optimal performance
143+
4. **Bounded Concurrency**: Semaphore limits prevent resource exhaustion
144+
5. **RAII for FFI**: Proper cleanup of libxml2 resources via Drop trait
145+
146+
## Testing Philosophy
147+
148+
### Test Structure
149+
150+
The project has **214+ passing tests** organized as:
151+
- **115 unit tests** in `src/` modules (fast, no I/O)
152+
- **99 integration tests** in `tests/` (slower, includes I/O simulation)
153+
- **24 ignored tests** (network-dependent, run explicitly with `--ignored`)
154+
155+
### Critical Testing Rules
156+
157+
1. **No Unsafe Code in Tests**: All environment variable manipulation must use `MockEnvProvider` pattern (see `src/config.rs` tests)
158+
159+
2. **No Real Network Calls**: Tests making HTTP requests to external services (httpbin.org) must be marked `#[ignore]`
160+
```rust
161+
#[tokio::test]
162+
#[ignore] // Requires internet connectivity - run with: cargo test -- --ignored
163+
async fn test_network_operation() { ... }
164+
```
165+
166+
3. **Deterministic Tests Only**: Never use:
167+
- `tokio::time::sleep()` without proper synchronization
168+
- `tokio::spawn()` without waiting for completion
169+
- Real system time for timing assertions
170+
171+
4. **Race Condition Prevention**: When testing concurrent code, use proper synchronization:
172+
```rust
173+
// BAD: Race condition
174+
tokio::spawn(async move { /* ... */ });
175+
tokio::time::sleep(Duration::from_millis(50)).await; // Hope it finishes
176+
177+
// GOOD: Proper synchronization
178+
let handle = tokio::spawn(async move { /* ... */ });
179+
handle.await.unwrap(); // Wait for completion
180+
```
181+
182+
### Running Flaky/Network Tests
183+
184+
Network tests are ignored by default to ensure CI reliability:
185+
```bash
186+
# Run only network tests
187+
cargo test -- --ignored
188+
189+
# Run all tests including network tests
190+
cargo test -- --include-ignored
191+
```
192+
193+
## Environment Variables
194+
195+
The config system supports environment variable overrides:
196+
197+
```bash
198+
# Cache configuration
199+
export VALIDATE_XML_CACHE_DIR=/custom/cache
200+
export VALIDATE_XML_CACHE_TTL=48
201+
202+
# Validation settings
203+
export VALIDATE_XML_THREADS=4
204+
export VALIDATE_XML_TIMEOUT=120
205+
206+
# Output settings
207+
export VALIDATE_XML_VERBOSE=true
208+
export VALIDATE_XML_FORMAT=json
209+
```
210+
211+
## libxml2 FFI Critical Notes
212+
213+
When working with `libxml2.rs`:
214+
215+
1. **Memory Safety**: All pointers must be checked for null before dereferencing
216+
2. **Cleanup**: Schema contexts must be freed via `xmlSchemaFree` in Drop implementations
217+
3. **Thread Safety** (see ARCHITECTURE_CHANGES.md for details):
218+
- **Schema parsing** (`xmlSchemaParse`): NOT thread-safe, serialized via cache
219+
- **Validation** (`xmlSchemaValidateFile`): IS thread-safe, runs in parallel
220+
- Arc-wrapped schemas enable safe sharing across tasks
221+
- Each validation creates its own context (per-task isolation)
222+
4. **Error Handling**: libxml2 prints errors to stderr - this is expected in tests (e.g., "Schemas parser error" messages)
223+
224+
Example safe pattern:
225+
```rust
226+
impl Drop for SchemaContext {
227+
fn drop(&mut self) {
228+
unsafe {
229+
if !self.schema.is_null() {
230+
xmlSchemaFree(self.schema);
231+
}
232+
}
233+
}
234+
}
235+
```
236+
237+
## Dependency Injection Pattern
238+
239+
For testability, the config system uses trait-based dependency injection:
240+
241+
```rust
242+
// Production: uses real environment variables
243+
ConfigManager::apply_environment_overrides(config)
244+
245+
// Testing: uses mock provider (no unsafe code)
246+
let mut mock_env = MockEnvProvider::new();
247+
mock_env.set("VALIDATE_XML_THREADS", "16");
248+
ConfigManager::apply_environment_overrides_with(&mock_env, config)
249+
```
250+
251+
**Never** use `std::env::set_var` or `std::env::remove_var` in tests - always use `MockEnvProvider`.
252+
253+
## Performance Considerations
254+
255+
1. **Schema Caching**: First run downloads schemas (~30s for 20k files), subsequent runs use cache (~2s)
256+
2. **Concurrency**: Default = CPU cores, but can be limited for memory-constrained systems
257+
3. **Memory**: Bounded by L1 cache size (default 100 entries) and concurrent task count
258+
4. **Network**: HTTP client uses connection pooling and retry logic with exponential backoff
259+
260+
## Common Gotchas
261+
262+
1. **libxml2 Errors to stderr**: The message "Schemas parser error : The XML document 'in_memory_buffer' is not a schema document" is EXPECTED in test output - it's from tests validating error handling
263+
264+
2. **Timing Tests**: Any test using `tokio::time::sleep()` is likely flaky - refactor to use proper synchronization
265+
266+
3. **Environment Pollution**: Tests must not modify global environment state - use `MockEnvProvider` pattern
267+
268+
4. **Ignored Tests**: Running full test suite may show "24 ignored" - this is correct (network tests)
269+
270+
## Code Generation and AI Assistance
271+
272+
This project was collaboratively developed with Claude Code. When making changes:
273+
274+
1. Maintain the existing architecture patterns (async-first, dependency injection, trait-based abstractions)
275+
2. Add tests for all new functionality (aim for 100% coverage)
276+
3. Update documentation strings for public APIs
277+
4. Run full test suite before committing: `cargo test && cargo clippy`
278+
5. For network-dependent code, mark tests with `#[ignore]` and document why

0 commit comments

Comments
 (0)