-
Notifications
You must be signed in to change notification settings - Fork 133
feat: reduce duplicate fields on join #1184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few suggestions.
Also, it would be good to update docs/source/.../joins.rst to mention keep_duplicate_keys
| assert result.column(0) == pa.array([1, 2, 3, 4, 5, 6]) | ||
| assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9]) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if new tests for keep_duplicate_keys=False or True were added. To ensure coverage, add tests verifying that passing keep_duplicate_keys=True preserves both columns.
python/datafusion/dataframe.py
Outdated
| left_on: str | Sequence[str] | None = None, | ||
| right_on: str | Sequence[str] | None = None, | ||
| join_keys: tuple[list[str], list[str]] | None = None, | ||
| keep_duplicate_keys: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name keep_duplicate_keys is somewhat confusing: it drops the right-side keys when False.
A more direct name like drop_duplicate_keys: bool = False (default) or deduplicate: bool = False may better express intent.
Additionally, in many DataFrame libraries (Pandas, PySpark), the term suffixes or indicator is used for duplicate‐column handling. Consider whether a suffix‐based approach (with default ('', '_right')) could be more familiar to users than a boolean drop flag.
8e1ed67 to
1e7f874
Compare
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: kosiew <kosiew@gmail.com>
Which issue does this PR close?
Closes #1173
Rationale for this change
In the current version of the code when you do a join and there is a common
oncolumn name, then you end up with two columns in the output dataframe with ambiguous names. This is an annoyance for users where they have to work around by renaming the column to join on. With this change, it makes the interface more user friendly.What changes are included in this PR?
keep_duplicate_keysif the user does not want to drop duplicate column namesonand notjoin_onAre there any user-facing changes?
Yes.
DataFrame.join()by default will now only return a single column for duplicateonkeys. The user can revert to the previous version by settingkeep_duplicate_keystoTrue.