|
| 1 | +# Dataframely - Coding Agent Instructions |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in |
| 6 | +polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It |
| 7 | +supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes. |
| 8 | + |
| 9 | +## Tech Stack |
| 10 | + |
| 11 | +### Core Technologies |
| 12 | + |
| 13 | +- **Python**: Primary language for the public API |
| 14 | +- **Rust**: Backend for polars plugin and custom regex operations |
| 15 | +- **Polars**: Only supported data frame library |
| 16 | +- **pyo3 & maturin**: Rust-Python bindings and build system |
| 17 | +- **pixi**: Primary environment and task manager (NOT pip/conda directly) |
| 18 | + |
| 19 | +### Build System |
| 20 | + |
| 21 | +- **maturin**: Builds the Rust extension module `dataframely._native` |
| 22 | +- **Cargo**: Rust dependency management |
| 23 | +- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components |
| 24 | + |
| 25 | +## Environment Setup |
| 26 | + |
| 27 | +**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically |
| 28 | +required for Rust-only operations. |
| 29 | + |
| 30 | +### Initial Setup |
| 31 | + |
| 32 | +Unless already performed via external setup steps: |
| 33 | + |
| 34 | +```bash |
| 35 | +# Install Rust toolchain |
| 36 | +rustup show |
| 37 | + |
| 38 | +# Install pixi environment and dependencies |
| 39 | +pixi install |
| 40 | + |
| 41 | +# Build and install the package locally (REQUIRED after Rust changes) |
| 42 | +pixi run postinstall |
| 43 | +``` |
| 44 | + |
| 45 | +### After Rust Code Changes |
| 46 | + |
| 47 | +**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension. |
| 48 | + |
| 49 | +## Development Workflow |
| 50 | + |
| 51 | +### Running Tests |
| 52 | + |
| 53 | +```bash |
| 54 | +# Run all tests (excludes S3 tests by default) |
| 55 | +pixi run test |
| 56 | + |
| 57 | +# Run tests with S3 backend (requires moto server) |
| 58 | +pixi run test -m s3 |
| 59 | + |
| 60 | +# Run specific test file or directory |
| 61 | +pixi run test tests/schema/ |
| 62 | + |
| 63 | +# Run with coverage |
| 64 | +pixi run test-coverage |
| 65 | + |
| 66 | +# Run benchmarks |
| 67 | +pixi run test-bench |
| 68 | +``` |
| 69 | + |
| 70 | +### Code Quality |
| 71 | + |
| 72 | +**NEVER** run linters/formatters directly. Use pre-commit: |
| 73 | + |
| 74 | +```bash |
| 75 | +# Run all pre-commit hooks |
| 76 | +pixi run pre-commit run |
| 77 | +``` |
| 78 | + |
| 79 | +Pre-commit handles: |
| 80 | + |
| 81 | +- **Python**: ruff (lint & format), mypy (type checking), docformatter |
| 82 | +- **Rust**: cargo fmt, cargo clippy |
| 83 | +- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace |
| 84 | + |
| 85 | +### Building Documentation |
| 86 | + |
| 87 | +```bash |
| 88 | +# Build documentation |
| 89 | +pixi run -e docs postinstall |
| 90 | +pixi run docs |
| 91 | + |
| 92 | +# Open in browser (macOS) |
| 93 | +open docs/_build/html/index.html |
| 94 | +``` |
| 95 | + |
| 96 | +## Project Structure |
| 97 | + |
| 98 | +``` |
| 99 | +dataframely/ # Python package |
| 100 | + schema.py # Core Schema class for DataFrame validation |
| 101 | + collection/ # Collection class for validating multiple interconnected DataFrames |
| 102 | + columns/ # Column type definitions (String, Integer, Float, etc.) |
| 103 | + testing/ # Testing utilities (factories, masks, storage mocks) |
| 104 | + _storage/ # Storage backends (Parquet, Delta Lake) |
| 105 | + _rule.py # Rule decorator for validation rules |
| 106 | + _plugin.py # Polars plugin registration |
| 107 | + _native.pyi # Type stubs for Rust extension |
| 108 | +
|
| 109 | +src/ # Rust source code |
| 110 | + lib.rs # PyO3 module definition |
| 111 | + polars_plugin/ # Custom polars plugin for validation |
| 112 | + regex/ # Custom regex operations |
| 113 | +
|
| 114 | +tests/ # Unit tests (mirrors dataframely/ structure) |
| 115 | + benches/ # Benchmark tests |
| 116 | + conftest.py # Shared pytest fixtures (including s3_server) |
| 117 | +
|
| 118 | +docs/ # Sphinx documentation |
| 119 | + guides/ # User guides and examples |
| 120 | + api/ # Auto-generated API reference |
| 121 | +``` |
| 122 | + |
| 123 | +## Pixi Environments |
| 124 | + |
| 125 | +Multiple environments for different purposes: |
| 126 | + |
| 127 | +- **default**: Base Python + core dependencies |
| 128 | +- **dev**: Includes jupyter for notebooks |
| 129 | +- **test**: Testing dependencies (pytest, moto, boto3, etc.) |
| 130 | +- **docs**: Documentation building (sphinx, myst-parser, etc.) |
| 131 | +- **lint**: Linting and formatting tools |
| 132 | +- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy) |
| 133 | +- **py310-py314**: Python version-specific environments |
| 134 | + |
| 135 | +Use `-e <env>` to run commands in specific environments: |
| 136 | + |
| 137 | +```bash |
| 138 | +pixi run -e test test |
| 139 | +pixi run -e docs docs |
| 140 | +``` |
| 141 | + |
| 142 | +## API Design Principles |
| 143 | + |
| 144 | +### Critical Guidelines |
| 145 | + |
| 146 | +1. **NO BREAKING CHANGES**: Public API must remain backward compatible |
| 147 | +2. **100% Test Coverage**: All new code requires tests |
| 148 | +3. **Documentation Required**: All public features need docstrings + API docs |
| 149 | +4. **Cautious API Extension**: Avoid adding to public API unless necessary |
| 150 | + |
| 151 | +### Public API |
| 152 | + |
| 153 | +Public exports are in `dataframely/__init__.py`. Main components: |
| 154 | + |
| 155 | +- **Schema classes**: `Schema` for DataFrame validation |
| 156 | +- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation |
| 157 | +- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc. |
| 158 | +- **Decorators**: `@rule()`, `@filter()` |
| 159 | +- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation` |
| 160 | + |
| 161 | +## Common Pitfalls & Solutions |
| 162 | + |
| 163 | +### S3 Testing |
| 164 | + |
| 165 | +The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file). |
| 166 | + |
| 167 | +**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends. |
| 168 | + |
| 169 | +## Testing Strategy |
| 170 | + |
| 171 | +- Tests are organized by module, mirroring the `dataframely/` structure |
| 172 | +- Use `dy.Schema.sample()` for generating test data |
| 173 | +- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution |
| 174 | +- S3 tests use moto server fixture from `conftest.py` |
| 175 | +- Benchmark tests in `tests/benches/` use pytest-benchmark |
| 176 | + |
| 177 | +## Validation Pattern |
| 178 | + |
| 179 | +Typical usage pattern: |
| 180 | + |
| 181 | +```python |
| 182 | +class MySchema(dy.Schema): |
| 183 | + col = dy.String(nullable=False) |
| 184 | + |
| 185 | + @dy.rule() |
| 186 | + def my_rule(cls) -> pl.Expr: |
| 187 | + return pl.col("col").str.len_chars() > 0 |
| 188 | + |
| 189 | +# Validate and cast |
| 190 | +validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True) |
| 191 | +``` |
| 192 | + |
| 193 | +## Key Configuration Files |
| 194 | + |
| 195 | +- `pixi.toml`: Environment and task definitions |
| 196 | +- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest) |
| 197 | +- `Cargo.toml`: Rust dependencies and build settings |
| 198 | +- `.pre-commit-config.yaml`: All code quality checks |
| 199 | +- `rust-toolchain.toml`: Rust nightly version specification |
| 200 | + |
| 201 | +## When Making Changes |
| 202 | + |
| 203 | +1. **Python code**: Run `pixi run pre-commit run` before committing |
| 204 | +2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests |
| 205 | +3. **Tests**: Ensure `pixi run test` passes |
| 206 | +4. **Documentation**: Update docstrings |
| 207 | +5. **API changes**: Ensure backward compatibility or document migration path |
| 208 | + |
| 209 | +## Performance Considerations |
| 210 | + |
| 211 | +- Validation uses native polars expressions for performance |
| 212 | +- Custom Rust plugin for advanced validation logic |
| 213 | +- Lazy evaluation supported via `LazyFrame` for large datasets |
| 214 | +- Avoid materializing data unnecessarily in validation rules |
0 commit comments