Skip to content

Conversation

@dcherian
Copy link
Collaborator

@dcherian dcherian commented Jul 28, 2024

  • add group starts to kth
  • nantopk & topk; which is public?
  • edit CLAUDE.md
  • add numpy to topk definition

This is pretty ugly:

  1. Doesn't handle topk and nantopk yet
  2. Unlike quantile, topk can work as a tree-reduce algorithm. So we have to hack the k kwarg, passed in finalize_kwargs in many places. This isn't an issue for quantile, because that only works blockwise

pre-commit-ci bot and others added 4 commits July 28, 2024 20:16
* main:
  Fix first, last again (#381)
  Fix docs build (#382)
  Optimize for-loop merging of cohorts. (#378)
  Add cohorts snapshot tests with syrupy (#379)
  Stricter tolerance in property tests (#377)
  Better gridlines
  Update containment image with gridlines, higher dpi
@dcherian dcherian mentioned this pull request Aug 7, 2024
1 task
dcherian and others added 15 commits January 6, 2025 17:50
* main:
  Revert "Support first, last with datetime, timedelta (#402)" (#404)
  Support first, last with datetime, timedelta (#402)
  Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401)
  Bump mamba-org/setup-micromamba from 1 to 2 (#400)
  Revert "[revert] test with Xarray PR branch" (#393)
  [pre-commit.ci] pre-commit autoupdate (#399)
  Faster subsetting for cohorts (#397)
  Fix default int on windows, numpy<2 (#395)
  Avoid rechunking when preferred_method="blockwise" (#394)
  Preserve dtype better when specified. (#389)
  Drop python 3.9, use ruff (#392)
  silence warning (#390)
  Expand groupby_reduce property tests (#385)
  Fix bug with NaNs in `by` and method='blockwise' (#384)
  Avoid explicit np.nan, np.inf (#383)
* main:
  Optimize quantile. (#409)
dcherian and others added 30 commits July 16, 2025 16:25
For groups with fewer than k elements, clamp extraction indices to
stay within group boundaries instead of filling entire group with NaN.
This allows returning partial results (e.g., [NaN, val1, val2] for a
group with 2 values when k=3).

Changes:
- Compute group start/end positions from offset
- Clamp lo_ indices to [group_start, group_end) range
- Mark positions outside valid range with badmask
- Extract all indices in lo_, not just virtual_index

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add nantopk function that replaces NaN with -inf/+inf before calling
topk. This is used as the combine function for topk aggregations to
ensure NaN values don't interfere with selecting top k elements across
chunks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes to topk aggregation:
- Use nantopk as combine function instead of topk
- Add _topk_finalize to convert -inf fill values back to NaN
- Remove min_count enforcement (groups with <k elements return partial
  results instead of being entirely masked)

This enables topk to work correctly in dask's map-reduce pathway where
intermediate results from chunks need to be combined.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Aggregations with new_dims_func (quantile, topk) add an extra dimension
to intermediate results. These must use _simple_combine which reduces
along DUMMY_AXIS, not _grouped_combine which reduces along the groups
axis.

This ensures topk works correctly even when reindex=False and
labels_are_unknown=True.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
topk with reindex=False has issues with group alignment during the
combine step in map-reduce. Disable this combination and raise a clear
error message.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_topk_with_nan: Verifies NaNs are excluded for both k > 0 and k < 0
- Add test_topk_fewer_than_k_elements: Tests NaN padding when group size < |k|
- Add test_topk_all_nan_group: Tests all-NaN groups return all NaN results
- Add test_topk_fill_value_correctness: Tests missing groups get NaN fill value
- Remove resolved FIXME comment about fill_value (already handled in _initialize_aggregation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add flox/_version.py to .gitignore (auto-generated by setuptools_scm)
- Add temporary files to .gitignore: *.rej, *.py.rej, Untitled.ipynb
- Add development/profiling files: mutmut-cache, profile.json, profile.html, mydask.png, test.png, uv.lock, devel/
- Remove all these files from git tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Resolved conflicts by:
- Taking main's code organization (modularized into separate files)
- Adding topk support to the new structure:
  - Added topk_new_dims_func import to core.py
  - Added "topk" to _is_bool_supported_reduction in lib.py
  - Added "topk" to _choose_engine to force flox engine
  - Added topk validation for finalize_kwargs["k"]
  - Added topk to groupby_reduce docstring
  - Added Notes about topk limitations
  - Added reindex=False check for topk
  - Updated chunk_reduce to handle topk new_dims
  - Updated dask.py to handle topk:
    - Modified _expand_dims to skip first intermediate for topk
    - Added must_use_simple_combine logic for new_dims_func
    - Added topk kwargs passing to chunk_reduce

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add type ignore comment for normalize_axis_tuple unpacking.
Mypy incorrectly infers the tuple length, but we know it's always
a 1-tuple since axis is a single integer in this context.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Convert single-element axis tuples to integers for numpy functions
like argmax/argmin that don't accept tuple axis parameters.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This fixes the issue where Aggregation objects with different finalize_kwargs
(e.g., k=3 vs k=-3 for topk) were being cached and reused by dask because they
had the same token. Now finalize_kwargs is included in __dask_tokenize__.

Also fixed edge case where k might not be in finalize_kwargs yet.
The test was failing for topk operations with chunked arrays due to:

1. Incorrect axis transformation: Using swapaxes instead of moveaxis
   caused dimension reordering. Changed to moveaxis to only move the k
   dimension without affecting other dimensions.

2. Missing sorting in second test loop: The second assertion loop
   (testing different method/reindex combinations) was missing the
   sorting step for actual results.

Also update .gitignore to exclude .claude/, .mutmut-cache, and remove
these files from git tracking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The recent fix for single-element tuples in _simple_combine (commit
2ca52b1) converts single-element axis tuples to integers for numpy
functions. However, _var_combine expects axis to always be iterable.

Added a check at the start of _var_combine to ensure axis is always
a tuple, fixing TypeError when combining var/std aggregations with
single-element axis.

Fixes test_fill_value_behaviour[numbagg-std-123-3] and related tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The topk branch added logic to force aggregations with new_dims_func to
use _simple_combine. However, argmax/argmin have new_dims_func set (even
though it returns an empty tuple), which caused them to incorrectly use
_simple_combine instead of _grouped_combine, resulting in off-by-one
errors in the returned indices.

Fix: Check if new_dims_func actually returns non-empty dimensions before
forcing _simple_combine path.

Also update _var_combine to handle both integer and tuple axis parameters
by normalizing to tuple at the start, since _simple_combine now passes
axis_ as-is (which is always a tuple).

Fixes test_groupby_reduce_axis_subset_against_numpy for argmax/argmin
and test_fill_value_behaviour for std/var aggregations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed documentation of:
- Map-reduce combine strategies (_simple_combine vs _grouped_combine)
- Aggregations with new dimensions (topk, quantile)
- topk-specific implementation details
- Axis parameter handling conventions
- Test organization and expectations
- Common pitfalls to avoid

This knowledge was gained from fixing test_groupby_reduce_all and
test_groupby_reduce_axis_subset_against_numpy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Restore full parameter matrix for test_groupby_reduce_all:
- size: [(1, 12), (12,), (12, 9)]
- nby: [1, 2, 3]
- add_nan_by: [True, False]
- func: ALL_FUNCS

This ensures comprehensive testing across all aggregations.
All 6228 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This reverts commit 73ba70c.
The previous logic `bool(agg.new_dims_func) and bool(agg.new_dims_func(**agg.finalize_kwargs))`
was redundant and inefficient. More importantly, it duplicated the logic that's
already encapsulated in the `agg.num_new_vector_dims` cached property.

Changed to use `agg.num_new_vector_dims > 0` which:
- Directly checks if the aggregation actually adds vector dimensions
- Leverages the existing cached property for efficiency
- Is cleaner and more maintainable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix xarray.py to handle topk's new k dimension in apply_ufunc
- Add comprehensive test suite for topk with xarray
- Tests cover basic functionality, negative k, and dask arrays

The bug was that xarray_reduce only handled quantile's new dimension,
not topk's k dimension, causing dimension mismatch errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Move the topk tests from the separate test_xarray_topk.py file
into the main test_xarray.py file to keep xarray tests together.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants