Add topk #374

dcherian · 2024-07-28T04:44:52Z

add group starts to kth
nantopk & topk; which is public?
edit CLAUDE.md
add numpy to topk definition

This is pretty ugly:

Doesn't handle topk and nantopk yet
Unlike quantile, topk can work as a tree-reduce algorithm. So we have to hack the k kwarg, passed in finalize_kwargs in many places. This isn't an issue for quantile, because that only works blockwise

for more information, see https://pre-commit.ci

* main: Fix first, last again (#381) Fix docs build (#382) Optimize for-loop merging of cohorts. (#378) Add cohorts snapshot tests with syrupy (#379) Stricter tolerance in property tests (#377) Better gridlines Update containment image with gridlines, higher dpi

* main: Revert "Support first, last with datetime, timedelta (#402)" (#404) Support first, last with datetime, timedelta (#402) Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401) Bump mamba-org/setup-micromamba from 1 to 2 (#400) Revert "[revert] test with Xarray PR branch" (#393) [pre-commit.ci] pre-commit autoupdate (#399) Faster subsetting for cohorts (#397) Fix default int on windows, numpy<2 (#395) Avoid rechunking when preferred_method="blockwise" (#394) Preserve dtype better when specified. (#389) Drop python 3.9, use ruff (#392) silence warning (#390) Expand groupby_reduce property tests (#385) Fix bug with NaNs in `by` and method='blockwise' (#384) Avoid explicit np.nan, np.inf (#383)

* main: Optimize quantile. (#409)

tests/test_properties.py

For groups with fewer than k elements, clamp extraction indices to stay within group boundaries instead of filling entire group with NaN. This allows returning partial results (e.g., [NaN, val1, val2] for a group with 2 values when k=3). Changes: - Compute group start/end positions from offset - Clamp lo_ indices to [group_start, group_end) range - Mark positions outside valid range with badmask - Extract all indices in lo_, not just virtual_index 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add nantopk function that replaces NaN with -inf/+inf before calling topk. This is used as the combine function for topk aggregations to ensure NaN values don't interfere with selecting top k elements across chunks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes to topk aggregation: - Use nantopk as combine function instead of topk - Add _topk_finalize to convert -inf fill values back to NaN - Remove min_count enforcement (groups with <k elements return partial results instead of being entirely masked) This enables topk to work correctly in dask's map-reduce pathway where intermediate results from chunks need to be combined. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Aggregations with new_dims_func (quantile, topk) add an extra dimension to intermediate results. These must use _simple_combine which reduces along DUMMY_AXIS, not _grouped_combine which reduces along the groups axis. This ensures topk works correctly even when reindex=False and labels_are_unknown=True. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

topk with reindex=False has issues with group alignment during the combine step in map-reduce. Disable this combination and raise a clear error message. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add test_topk_with_nan: Verifies NaNs are excluded for both k > 0 and k < 0 - Add test_topk_fewer_than_k_elements: Tests NaN padding when group size < |k| - Add test_topk_all_nan_group: Tests all-NaN groups return all NaN results - Add test_topk_fill_value_correctness: Tests missing groups get NaN fill value - Remove resolved FIXME comment about fill_value (already handled in _initialize_aggregation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add flox/_version.py to .gitignore (auto-generated by setuptools_scm) - Add temporary files to .gitignore: *.rej, *.py.rej, Untitled.ipynb - Add development/profiling files: mutmut-cache, profile.json, profile.html, mydask.png, test.png, uv.lock, devel/ - Remove all these files from git tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Resolved conflicts by: - Taking main's code organization (modularized into separate files) - Adding topk support to the new structure: - Added topk_new_dims_func import to core.py - Added "topk" to _is_bool_supported_reduction in lib.py - Added "topk" to _choose_engine to force flox engine - Added topk validation for finalize_kwargs["k"] - Added topk to groupby_reduce docstring - Added Notes about topk limitations - Added reindex=False check for topk - Updated chunk_reduce to handle topk new_dims - Updated dask.py to handle topk: - Modified _expand_dims to skip first intermediate for topk - Added must_use_simple_combine logic for new_dims_func - Added topk kwargs passing to chunk_reduce 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add type ignore comment for normalize_axis_tuple unpacking. Mypy incorrectly infers the tuple length, but we know it's always a 1-tuple since axis is a single integer in this context. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Convert single-element axis tuples to integers for numpy functions like argmax/argmin that don't accept tuple axis parameters. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This fixes the issue where Aggregation objects with different finalize_kwargs (e.g., k=3 vs k=-3 for topk) were being cached and reused by dask because they had the same token. Now finalize_kwargs is included in __dask_tokenize__. Also fixed edge case where k might not be in finalize_kwargs yet.

The test was failing for topk operations with chunked arrays due to: 1. Incorrect axis transformation: Using swapaxes instead of moveaxis caused dimension reordering. Changed to moveaxis to only move the k dimension without affecting other dimensions. 2. Missing sorting in second test loop: The second assertion loop (testing different method/reindex combinations) was missing the sorting step for actual results. Also update .gitignore to exclude .claude/, .mutmut-cache, and remove these files from git tracking. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The recent fix for single-element tuples in _simple_combine (commit 2ca52b1) converts single-element axis tuples to integers for numpy functions. However, _var_combine expects axis to always be iterable. Added a check at the start of _var_combine to ensure axis is always a tuple, fixing TypeError when combining var/std aggregations with single-element axis. Fixes test_fill_value_behaviour[numbagg-std-123-3] and related tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The topk branch added logic to force aggregations with new_dims_func to use _simple_combine. However, argmax/argmin have new_dims_func set (even though it returns an empty tuple), which caused them to incorrectly use _simple_combine instead of _grouped_combine, resulting in off-by-one errors in the returned indices. Fix: Check if new_dims_func actually returns non-empty dimensions before forcing _simple_combine path. Also update _var_combine to handle both integer and tuple axis parameters by normalizing to tuple at the start, since _simple_combine now passes axis_ as-is (which is always a tuple). Fixes test_groupby_reduce_axis_subset_against_numpy for argmax/argmin and test_fill_value_behaviour for std/var aggregations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add detailed documentation of: - Map-reduce combine strategies (_simple_combine vs _grouped_combine) - Aggregations with new dimensions (topk, quantile) - topk-specific implementation details - Axis parameter handling conventions - Test organization and expectations - Common pitfalls to avoid This knowledge was gained from fixing test_groupby_reduce_all and test_groupby_reduce_axis_subset_against_numpy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Restore full parameter matrix for test_groupby_reduce_all: - size: [(1, 12), (12,), (12, 9)] - nby: [1, 2, 3] - add_nan_by: [True, False] - func: ALL_FUNCS This ensures comprehensive testing across all aggregations. All 6228 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This reverts commit 73ba70c.

The previous logic `bool(agg.new_dims_func) and bool(agg.new_dims_func(**agg.finalize_kwargs))` was redundant and inefficient. More importantly, it duplicated the logic that's already encapsulated in the `agg.num_new_vector_dims` cached property. Changed to use `agg.num_new_vector_dims > 0` which: - Directly checks if the aggregation actually adds vector dimensions - Leverages the existing cached property for efficiency - Is cleaner and more maintainable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix xarray.py to handle topk's new k dimension in apply_ufunc - Add comprehensive test suite for topk with xarray - Tests cover basic functionality, negative k, and dask arrays The bug was that xarray_reduce only handled quantile's new dimension, not topk's k dimension, causing dimension mismatch errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Move the topk tests from the separate test_xarray_topk.py file into the main test_xarray.py file to keep xarray tests together. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

dcherian force-pushed the topk branch from d23cd79 to e89a226 Compare July 28, 2024 14:25

dcherian added 6 commits July 28, 2024 14:16

Add topk

650088b

Negative k

889be0c

dask support

996ff2a

test

776d233

wip

a5eb7b9

fix

4fa9a4c

dcherian force-pushed the topk branch from 7082c55 to 4fa9a4c Compare July 28, 2024 20:16

pre-commit-ci bot and others added 4 commits July 28, 2024 20:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

4b04fde

for more information, see https://pre-commit.ci

Handle dtypes.NA properly for datetime/timedelta

93800aa

Fix

80c67f4

dcherian mentioned this pull request Aug 7, 2024

Avoid explicit np.nan, np.inf #383

Merged

1 task

dcherian and others added 15 commits January 6, 2025 17:50

Fixes

c924017

one more fix

7a794ba

fix

eec4dd4

one more fix

6ac9a1f

Fixes.

83594e8

WIP

740f85f

Merge branch 'main' into topk

5d64fd9

* main: Optimize quantile. (#409)

fixes

e177efd

fix

9393470

cleanup

17eb915

works?

dc0df3e

fix quantile

83ae5d8

optimize xrutils.topk

95d20b8

Merge branch 'main' into topk

0b9fafc

dcherian commented Jan 8, 2025

View reviewed changes

tests/test_properties.py Outdated Show resolved Hide resolved

Update tests/test_properties.py

caa98b8

dcherian and others added 30 commits July 16, 2025 16:25

fix

0f7ee05

Remove docs/oisst.ipynb from git tracking

f61c856

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

move _DUMMY_AXIS to dask.py

cbf9596

Avoid squeezing

bacda97

Properly handle empty input

8d1606d

Fix combining

35a8a22

FIx test

15aa174

revert a change

a863f5b

tweak

73ba70c

Remove Claude settings files from git tracking

16bdec3

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Revert "tweak"

5785714

This reverts commit 73ba70c.

Move topk tests to test_xarray.py

c640f2e

Move the topk tests from the separate test_xarray_topk.py file into the main test_xarray.py file to keep xarray tests together. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add topk #374

Add topk #374

Uh oh!

dcherian commented Jul 28, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add topk #374

Are you sure you want to change the base?

Add topk #374

Uh oh!

Conversation

dcherian commented Jul 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcherian commented Jul 28, 2024 •

edited

Loading