Skip to content

Commit e102cde

Browse files
authored
chore: Add initial instructions for copilot (#216)
1 parent fbad38a commit e102cde

File tree

2 files changed

+240
-0
lines changed

2 files changed

+240
-0
lines changed

.github/copilot-instructions.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Dataframely - Coding Agent Instructions
2+
3+
## Project Overview
4+
5+
Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in
6+
polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It
7+
supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes.
8+
9+
## Tech Stack
10+
11+
### Core Technologies
12+
13+
- **Python**: Primary language for the public API
14+
- **Rust**: Backend for polars plugin and custom regex operations
15+
- **Polars**: Only supported data frame library
16+
- **pyo3 & maturin**: Rust-Python bindings and build system
17+
- **pixi**: Primary environment and task manager (NOT pip/conda directly)
18+
19+
### Build System
20+
21+
- **maturin**: Builds the Rust extension module `dataframely._native`
22+
- **Cargo**: Rust dependency management
23+
- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components
24+
25+
## Environment Setup
26+
27+
**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically
28+
required for Rust-only operations.
29+
30+
### Initial Setup
31+
32+
Unless already performed via external setup steps:
33+
34+
```bash
35+
# Install Rust toolchain
36+
rustup show
37+
38+
# Install pixi environment and dependencies
39+
pixi install
40+
41+
# Build and install the package locally (REQUIRED after Rust changes)
42+
pixi run postinstall
43+
```
44+
45+
### After Rust Code Changes
46+
47+
**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension.
48+
49+
## Development Workflow
50+
51+
### Running Tests
52+
53+
```bash
54+
# Run all tests (excludes S3 tests by default)
55+
pixi run test
56+
57+
# Run tests with S3 backend (requires moto server)
58+
pixi run test -m s3
59+
60+
# Run specific test file or directory
61+
pixi run test tests/schema/
62+
63+
# Run with coverage
64+
pixi run test-coverage
65+
66+
# Run benchmarks
67+
pixi run test-bench
68+
```
69+
70+
### Code Quality
71+
72+
**NEVER** run linters/formatters directly. Use pre-commit:
73+
74+
```bash
75+
# Run all pre-commit hooks
76+
pixi run pre-commit run
77+
```
78+
79+
Pre-commit handles:
80+
81+
- **Python**: ruff (lint & format), mypy (type checking), docformatter
82+
- **Rust**: cargo fmt, cargo clippy
83+
- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace
84+
85+
### Building Documentation
86+
87+
```bash
88+
# Build documentation
89+
pixi run -e docs postinstall
90+
pixi run docs
91+
92+
# Open in browser (macOS)
93+
open docs/_build/html/index.html
94+
```
95+
96+
## Project Structure
97+
98+
```
99+
dataframely/ # Python package
100+
schema.py # Core Schema class for DataFrame validation
101+
collection/ # Collection class for validating multiple interconnected DataFrames
102+
columns/ # Column type definitions (String, Integer, Float, etc.)
103+
testing/ # Testing utilities (factories, masks, storage mocks)
104+
_storage/ # Storage backends (Parquet, Delta Lake)
105+
_rule.py # Rule decorator for validation rules
106+
_plugin.py # Polars plugin registration
107+
_native.pyi # Type stubs for Rust extension
108+
109+
src/ # Rust source code
110+
lib.rs # PyO3 module definition
111+
polars_plugin/ # Custom polars plugin for validation
112+
regex/ # Custom regex operations
113+
114+
tests/ # Unit tests (mirrors dataframely/ structure)
115+
benches/ # Benchmark tests
116+
conftest.py # Shared pytest fixtures (including s3_server)
117+
118+
docs/ # Sphinx documentation
119+
guides/ # User guides and examples
120+
api/ # Auto-generated API reference
121+
```
122+
123+
## Pixi Environments
124+
125+
Multiple environments for different purposes:
126+
127+
- **default**: Base Python + core dependencies
128+
- **dev**: Includes jupyter for notebooks
129+
- **test**: Testing dependencies (pytest, moto, boto3, etc.)
130+
- **docs**: Documentation building (sphinx, myst-parser, etc.)
131+
- **lint**: Linting and formatting tools
132+
- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy)
133+
- **py310-py314**: Python version-specific environments
134+
135+
Use `-e <env>` to run commands in specific environments:
136+
137+
```bash
138+
pixi run -e test test
139+
pixi run -e docs docs
140+
```
141+
142+
## API Design Principles
143+
144+
### Critical Guidelines
145+
146+
1. **NO BREAKING CHANGES**: Public API must remain backward compatible
147+
2. **100% Test Coverage**: All new code requires tests
148+
3. **Documentation Required**: All public features need docstrings + API docs
149+
4. **Cautious API Extension**: Avoid adding to public API unless necessary
150+
151+
### Public API
152+
153+
Public exports are in `dataframely/__init__.py`. Main components:
154+
155+
- **Schema classes**: `Schema` for DataFrame validation
156+
- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation
157+
- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc.
158+
- **Decorators**: `@rule()`, `@filter()`
159+
- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation`
160+
161+
## Common Pitfalls & Solutions
162+
163+
### S3 Testing
164+
165+
The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file).
166+
167+
**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends.
168+
169+
## Testing Strategy
170+
171+
- Tests are organized by module, mirroring the `dataframely/` structure
172+
- Use `dy.Schema.sample()` for generating test data
173+
- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution
174+
- S3 tests use moto server fixture from `conftest.py`
175+
- Benchmark tests in `tests/benches/` use pytest-benchmark
176+
177+
## Validation Pattern
178+
179+
Typical usage pattern:
180+
181+
```python
182+
class MySchema(dy.Schema):
183+
col = dy.String(nullable=False)
184+
185+
@dy.rule()
186+
def my_rule(cls) -> pl.Expr:
187+
return pl.col("col").str.len_chars() > 0
188+
189+
# Validate and cast
190+
validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True)
191+
```
192+
193+
## Key Configuration Files
194+
195+
- `pixi.toml`: Environment and task definitions
196+
- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest)
197+
- `Cargo.toml`: Rust dependencies and build settings
198+
- `.pre-commit-config.yaml`: All code quality checks
199+
- `rust-toolchain.toml`: Rust nightly version specification
200+
201+
## When Making Changes
202+
203+
1. **Python code**: Run `pixi run pre-commit run` before committing
204+
2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
205+
3. **Tests**: Ensure `pixi run test` passes
206+
4. **Documentation**: Update docstrings
207+
5. **API changes**: Ensure backward compatibility or document migration path
208+
209+
## Performance Considerations
210+
211+
- Validation uses native polars expressions for performance
212+
- Custom Rust plugin for advanced validation logic
213+
- Lazy evaluation supported via `LazyFrame` for large datasets
214+
- Avoid materializing data unnecessarily in validation rules
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: Copilot Setup Steps
2+
on:
3+
pull_request:
4+
paths:
5+
- .github/workflows/copilot-setup-steps.yml
6+
workflow_dispatch:
7+
8+
jobs:
9+
copilot-setup-steps:
10+
runs-on: ubuntu-latest
11+
permissions:
12+
contents: read
13+
id-token: write
14+
steps:
15+
- name: Checkout branch
16+
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
17+
- name: Set up pixi
18+
uses: prefix-dev/setup-pixi@28eb668aafebd9dede9d97c4ba1cd9989a4d0004 # v0.9.2
19+
with:
20+
environments: default
21+
- name: Install Rust
22+
run: rustup show
23+
- name: Cache Rust dependencies
24+
uses: Swatinem/rust-cache@f13886b937689c021905a6b90929199931d60db1 # v2.8.1
25+
- name: Install repository
26+
run: pixi run postinstall

0 commit comments

Comments
 (0)