-
Notifications
You must be signed in to change notification settings - Fork 21
Add topk #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dcherian
wants to merge
88
commits into
main
Choose a base branch
from
topk
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for more information, see https://pre-commit.ci
* main: Revert "Support first, last with datetime, timedelta (#402)" (#404) Support first, last with datetime, timedelta (#402) Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401) Bump mamba-org/setup-micromamba from 1 to 2 (#400) Revert "[revert] test with Xarray PR branch" (#393) [pre-commit.ci] pre-commit autoupdate (#399) Faster subsetting for cohorts (#397) Fix default int on windows, numpy<2 (#395) Avoid rechunking when preferred_method="blockwise" (#394) Preserve dtype better when specified. (#389) Drop python 3.9, use ruff (#392) silence warning (#390) Expand groupby_reduce property tests (#385) Fix bug with NaNs in `by` and method='blockwise' (#384) Avoid explicit np.nan, np.inf (#383)
* main: Optimize quantile. (#409)
dcherian
commented
Jan 8, 2025
For groups with fewer than k elements, clamp extraction indices to stay within group boundaries instead of filling entire group with NaN. This allows returning partial results (e.g., [NaN, val1, val2] for a group with 2 values when k=3). Changes: - Compute group start/end positions from offset - Clamp lo_ indices to [group_start, group_end) range - Mark positions outside valid range with badmask - Extract all indices in lo_, not just virtual_index 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add nantopk function that replaces NaN with -inf/+inf before calling topk. This is used as the combine function for topk aggregations to ensure NaN values don't interfere with selecting top k elements across chunks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes to topk aggregation: - Use nantopk as combine function instead of topk - Add _topk_finalize to convert -inf fill values back to NaN - Remove min_count enforcement (groups with <k elements return partial results instead of being entirely masked) This enables topk to work correctly in dask's map-reduce pathway where intermediate results from chunks need to be combined. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Aggregations with new_dims_func (quantile, topk) add an extra dimension to intermediate results. These must use _simple_combine which reduces along DUMMY_AXIS, not _grouped_combine which reduces along the groups axis. This ensures topk works correctly even when reindex=False and labels_are_unknown=True. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
topk with reindex=False has issues with group alignment during the combine step in map-reduce. Disable this combination and raise a clear error message. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_topk_with_nan: Verifies NaNs are excluded for both k > 0 and k < 0 - Add test_topk_fewer_than_k_elements: Tests NaN padding when group size < |k| - Add test_topk_all_nan_group: Tests all-NaN groups return all NaN results - Add test_topk_fill_value_correctness: Tests missing groups get NaN fill value - Remove resolved FIXME comment about fill_value (already handled in _initialize_aggregation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add flox/_version.py to .gitignore (auto-generated by setuptools_scm) - Add temporary files to .gitignore: *.rej, *.py.rej, Untitled.ipynb - Add development/profiling files: mutmut-cache, profile.json, profile.html, mydask.png, test.png, uv.lock, devel/ - Remove all these files from git tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Resolved conflicts by:
- Taking main's code organization (modularized into separate files)
- Adding topk support to the new structure:
- Added topk_new_dims_func import to core.py
- Added "topk" to _is_bool_supported_reduction in lib.py
- Added "topk" to _choose_engine to force flox engine
- Added topk validation for finalize_kwargs["k"]
- Added topk to groupby_reduce docstring
- Added Notes about topk limitations
- Added reindex=False check for topk
- Updated chunk_reduce to handle topk new_dims
- Updated dask.py to handle topk:
- Modified _expand_dims to skip first intermediate for topk
- Added must_use_simple_combine logic for new_dims_func
- Added topk kwargs passing to chunk_reduce
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add type ignore comment for normalize_axis_tuple unpacking. Mypy incorrectly infers the tuple length, but we know it's always a 1-tuple since axis is a single integer in this context. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Convert single-element axis tuples to integers for numpy functions like argmax/argmin that don't accept tuple axis parameters. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This fixes the issue where Aggregation objects with different finalize_kwargs (e.g., k=3 vs k=-3 for topk) were being cached and reused by dask because they had the same token. Now finalize_kwargs is included in __dask_tokenize__. Also fixed edge case where k might not be in finalize_kwargs yet.
The test was failing for topk operations with chunked arrays due to: 1. Incorrect axis transformation: Using swapaxes instead of moveaxis caused dimension reordering. Changed to moveaxis to only move the k dimension without affecting other dimensions. 2. Missing sorting in second test loop: The second assertion loop (testing different method/reindex combinations) was missing the sorting step for actual results. Also update .gitignore to exclude .claude/, .mutmut-cache, and remove these files from git tracking. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The recent fix for single-element tuples in _simple_combine (commit 2ca52b1) converts single-element axis tuples to integers for numpy functions. However, _var_combine expects axis to always be iterable. Added a check at the start of _var_combine to ensure axis is always a tuple, fixing TypeError when combining var/std aggregations with single-element axis. Fixes test_fill_value_behaviour[numbagg-std-123-3] and related tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The topk branch added logic to force aggregations with new_dims_func to use _simple_combine. However, argmax/argmin have new_dims_func set (even though it returns an empty tuple), which caused them to incorrectly use _simple_combine instead of _grouped_combine, resulting in off-by-one errors in the returned indices. Fix: Check if new_dims_func actually returns non-empty dimensions before forcing _simple_combine path. Also update _var_combine to handle both integer and tuple axis parameters by normalizing to tuple at the start, since _simple_combine now passes axis_ as-is (which is always a tuple). Fixes test_groupby_reduce_axis_subset_against_numpy for argmax/argmin and test_fill_value_behaviour for std/var aggregations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed documentation of: - Map-reduce combine strategies (_simple_combine vs _grouped_combine) - Aggregations with new dimensions (topk, quantile) - topk-specific implementation details - Axis parameter handling conventions - Test organization and expectations - Common pitfalls to avoid This knowledge was gained from fixing test_groupby_reduce_all and test_groupby_reduce_axis_subset_against_numpy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Restore full parameter matrix for test_groupby_reduce_all: - size: [(1, 12), (12,), (12, 9)] - nby: [1, 2, 3] - add_nan_by: [True, False] - func: ALL_FUNCS This ensures comprehensive testing across all aggregations. All 6228 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This reverts commit 73ba70c.
The previous logic `bool(agg.new_dims_func) and bool(agg.new_dims_func(**agg.finalize_kwargs))` was redundant and inefficient. More importantly, it duplicated the logic that's already encapsulated in the `agg.num_new_vector_dims` cached property. Changed to use `agg.num_new_vector_dims > 0` which: - Directly checks if the aggregation actually adds vector dimensions - Leverages the existing cached property for efficiency - Is cleaner and more maintainable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Fix xarray.py to handle topk's new k dimension in apply_ufunc - Add comprehensive test suite for topk with xarray - Tests cover basic functionality, negative k, and dask arrays The bug was that xarray_reduce only handled quantile's new dimension, not topk's k dimension, causing dimension mismatch errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Move the topk tests from the separate test_xarray_topk.py file into the main test_xarray.py file to keep xarray tests together. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
kthThis is pretty ugly:
topkandnantopkyetkkwarg, passed infinalize_kwargsin many places. This isn't an issue for quantile, because that only works blockwise