Fix: fix AtomicData.py mask handling and ensure proper numpy arrays #286

AsymmetryChou · 2025-11-26T03:22:35Z

For issue #285 , I fix potential problems in AtomicData.py when treating isolated systems when self_interaction=False.

Convert the mask to a numpy array for consistent indexing
Ensure that all output arrays are proper numpy arrays to avoid compatibility issues in isolated systems.

Summary by CodeRabbit

Refactor
- Improved internal data handling for atomic systems: more robust indexing and reshaping for isolated/self-interaction cases, and safer device-aware conversions to ensure compatibility and correctness across edge cases.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-26T03:26:52Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Reworked the i == j branch in neighbor_list_and_relative_vec to use NumPy-based masking: convert the boolean mask to NumPy, extract indices and shifts with np.atleast_1d/np.atleast_2d for isolated cases, then rebuild per-edge tensors via NumPy selections and convert back to torch tensors.

Changes

Cohort / File(s)	Summary
Indexing logic refactor `dptb/data/AtomicData.py`	Replaced torch boolean-mask indexing for the self-interaction branch with NumPy-based masking. Converts mask to `mask_np`, extracts `o_first_idex`, `o_second_idex`, and `o_shift` using `np.atleast_1d`/`np.atleast_2d` to handle isolated systems, then reconstructs `first_idex`, `second_idex`, and `shifts` via NumPy selections and converts results back to torch tensors. Added comments explaining the NumPy-indexing rationale and preserved existing self-interaction reduction/expansion logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Attention points:
- Verify behavior for empty masks, single-element masks, and batched/isolated systems.
- Ensure shapes and dtypes are preserved across NumPy ↔ torch conversions.
- Check performance implications on hot paths and device transfers (CPU ↔ GPU).

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing mask handling in AtomicData.py and ensuring proper numpy array handling for isolated systems.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0eb5c52 and 0e074be.

📒 Files selected for processing (1)

dptb/data/AtomicData.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (2)

dptb/data/AtomicData.py (2)

1016-1027: LGTM! Past review comment addressed.

The conversion to numpy-based indexing and the use of np.atleast_1d/np.atleast_2d correctly handle isolated systems where there may be only a single i==j edge. Specifically, np.atleast_2d(o_shift) at line 1026 ensures the shape is (1, 3) rather than (3,), which prevents indexing errors in the loop at lines 1030-1043 where o_shift[i] expects to access a row vector.

1052-1056: LGTM! Consistent numpy-based indexing pattern.

The second conversion to numpy (line 1053) is necessary since mask was modified at line 1049. Using mask_np to index the numpy arrays before converting to torch tensors on the target device ensures consistent behavior and avoids compatibility issues with isolated systems.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

dptb/data/AtomicData.py (2)
1013-1027: Numpy-based masking for the i == j case looks correct; consider eliminating remaining torch/numpy mixing

The new mask_np = mask.numpy() path and using it to index first_idex, second_idex, and shifts, plus np.atleast_1d(...), should fix the earlier incompatibility when indexing numpy arrays with a torch.bool mask and protect against scalar/0‑D returns in edge cases. The loop over o_first_idex / o_shift keeps the previous behavior.

One minor clean‑up you might consider later: mask[first_idex == second_idex] = False still indexes a torch tensor with a numpy boolean array. It works today but slightly undermines the goal of avoiding cross‑framework indexing. You could instead keep the whole i == j masking logic in numpy (e.g., have mask as a numpy bool_ array from the start, only converting to torch right before building tensors).

1052-1056: Avoid double conversion and keep edge_index on the same device as pos

Using mask_np = mask.numpy() and then
first_idex = torch.as_tensor(first_idex[mask_np], dtype=torch.long, device=out_device)
second_idex = torch.as_tensor(second_idex[mask_np], dtype=torch.long, device=out_device)
shifts = torch.as_tensor(shifts[mask_np], dtype=out_dtype, device=out_device)
is aligned with the goal of returning proper tensors on out_device. However, edge_index is then built with
edge_index = torch.vstack(
    (torch.LongTensor(first_idex), torch.LongTensor(second_idex))
)
which forces edge_index back to CPU and incurs an extra copy if out_device is not CPU, while shifts and cell_tensor stay on out_device. This also contradicts the docstring claim that all outputs share the same device as pos.

You could instead reuse the already‑typed tensors and preserve the device:
edge_index = torch.stack((first_idex, second_idex), dim=0)
and, if you adopt this, you no longer need the torch.LongTensor(...) conversions.

Also applies to: 1064-1066

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd6677a and 6c05a71.

📒 Files selected for processing (1)

dptb/data/AtomicData.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dptb/data/AtomicData.py (2)

dptb/data/dataset/lmdb_dataset.py (1)

len (79-80)

dptb/data/dataset/_base_datasets.py (1)

len (170-173)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

Copilot

Pull request overview

This PR fixes mask handling issues in the neighbor_list_and_relative_vec function in AtomicData.py, particularly for isolated systems when self_interaction=False. The changes address compatibility issues between PyTorch and NumPy array indexing, and also fix a bug where torch.LongTensor() was incorrectly called with a device parameter.

Key changes:

Convert torch boolean masks to numpy arrays before indexing numpy arrays to avoid compatibility issues
Add np.atleast_1d() calls to ensure arrays are not scalars in edge cases with isolated systems
Replace buggy torch.LongTensor(..., device=...) calls with torch.as_tensor(..., dtype=torch.long, device=...)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copilot · 2025-11-26T03:29:43Z

dptb/data/AtomicData.py

+    # Ensure arrays are proper numpy arrays (not scalars) for isolated systems
+    o_first_idex = np.atleast_1d(o_first_idex)
+    o_second_idex = np.atleast_1d(o_second_idex)
+    o_shift = np.atleast_1d(o_shift)


Using np.atleast_1d on o_shift is incorrect. The shifts array from ASE has shape (n, 3) representing shift vectors. When indexing with ~mask_np results in a single match, o_shift would have shape (3,) representing one vector. Applying np.atleast_1d keeps it as (3,), which breaks the loop logic on lines 1031 and 1042 where o_shift[i] expects to index into the first dimension to get a vector, but would instead get a scalar. Use o_shift = np.atleast_2d(o_shift) or ensure the shape is preserved as (1, 3) for isolated systems.

Suggested change

o_shift = np.atleast_1d(o_shift)

o_shift = np.atleast_2d(o_shift)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

dptb/data/AtomicData.py (1)
1013-1044: Self-edge extraction and np.atleast_2d usage look correct; consider making .numpy() device-agnostic

The updated flow—using mask_np for numpy indexing and np.atleast_2d(o_shift)—correctly avoids torch/numpy masking issues and ensures o_shift[i] is always a 3‑component vector even when there's only a single i == j edge, which fixes the earlier shape bug in the loop.

To make this more robust in environments that set a non‑CPU default device (e.g. via torch.set_default_device), it would be safer to move mask explicitly to CPU before converting to numpy:
-    mask_np = mask.numpy()
+    mask_np = mask.cpu().numpy()
and likewise for the later occurrence before re-masking at line 1053. This avoids accidental failures if mask ever ends up on a CUDA device.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c05a71 and 0eb5c52.

📒 Files selected for processing (1)

dptb/data/AtomicData.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (1)

dptb/data/AtomicData.py (1)

1052-1056: Final tensor reconstruction from NumPy selections is consistent with original API

Rebuilding first_idex, second_idex, and shifts via mask_np and then converting them back to torch tensors on out_device with the appropriate dtypes preserves the original function contract while fixing the mixed torch/numpy masking issues. This segment looks good to me.

…ith numpy arrays

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

fix: Convert mask to numpy and Ensure arrays are proper numpy

6c05a71

AsymmetryChou requested review from Copilot and floatingCatty November 26, 2025 03:24

Copilot started reviewing on behalf of AsymmetryChou November 26, 2025 03:25 View session

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

Copilot finished reviewing on behalf of AsymmetryChou November 26, 2025 03:29

Copilot AI reviewed Nov 26, 2025

View reviewed changes

fix: Ensure o_shift is a 2D numpy array for isolated systems

0eb5c52

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

fix: Use .cpu().numpy() for mask conversion to ensure compatibility w…

0e074be

…ith numpy arrays

AsymmetryChou requested a review from Copilot November 26, 2025 06:54

Copilot started reviewing on behalf of AsymmetryChou November 26, 2025 06:55 View session

Copilot finished reviewing on behalf of AsymmetryChou November 26, 2025 06:59

Copilot AI reviewed Nov 26, 2025

View reviewed changes

floatingCatty approved these changes Nov 26, 2025

View reviewed changes

floatingCatty merged commit fd65c44 into deepmodeling:main Nov 26, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: fix AtomicData.py mask handling and ensure proper numpy arrays #286

Fix: fix AtomicData.py mask handling and ensure proper numpy arrays #286

Uh oh!

AsymmetryChou commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	o_shift = np.atleast_1d(o_shift)
	o_shift = np.atleast_2d(o_shift)

Fix: fix AtomicData.py mask handling and ensure proper numpy arrays #286

Fix: fix AtomicData.py mask handling and ensure proper numpy arrays #286

Uh oh!

Conversation

AsymmetryChou commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AsymmetryChou commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading