Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1261,6 +1261,7 @@ Other
- Bug in printing a :class:`DataFrame` with a :class:`DataFrame` stored in :attr:`DataFrame.attrs` raised a ``ValueError`` (:issue:`60455`)
- Bug in printing a :class:`Series` with a :class:`DataFrame` stored in :attr:`Series.attrs` raised a ``ValueError`` (:issue:`60568`)
- Deprecated the keyword ``check_datetimelike_compat`` in :meth:`testing.assert_frame_equal` and :meth:`testing.assert_series_equal` (:issue:`55638`)
- Fixed bug in :meth:`DataFrame.combine` with non-unique columns (:issue:`51340`)
- Fixed bug in :meth:`Series.replace` and :meth:`DataFrame.replace` when trying to replace :class:`NA` values in a :class:`Float64Dtype` object with ``np.nan``; this now works with ``pd.set_option("mode.nan_is_na", False)`` and is irrelevant otherwise (:issue:`55127`)
- Fixed bug in :meth:`Series.replace` and :meth:`DataFrame.replace` when trying to replace :class:`np.nan` values in a :class:`Int64Dtype` object with :class:`NA`; this is now a no-op with ``pd.set_option("mode.nan_is_na", False)`` and is irrelevant otherwise (:issue:`51237`)
- Fixed bug in the :meth:`Series.rank` with object dtype and extremely small float values (:issue:`62036`)
Expand Down
30 changes: 25 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,10 @@
nargsort,
)

from pandas.io.common import get_handle
from pandas.io.common import (
dedup_names,
get_handle,
)
from pandas.io.formats import (
console,
format as fmt,
Expand Down Expand Up @@ -9093,7 +9096,7 @@ def combine(
2 NaN 3.0 1.0
"""
other_idxlen = len(other.index) # save for compare
other_columns = other.columns
self_columns, other_columns = self.columns, other.columns

this, other = self.align(other)
new_index = this.index
Expand All @@ -9104,11 +9107,26 @@ def combine(
if self.empty and len(other) == other_idxlen:
return other.copy()

new_columns_out = self.columns.union(other_columns, sort=False)
# Deduplicate column names if necessary
self_columns = Index(
dedup_names(list(self_columns), False), dtype=self_columns.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would dedup_names be necessay here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having unique column names would preserve the original logic of using column names. Switching to indices would require multiple indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

)
other_columns = Index(
dedup_names(list(other_columns), False), dtype=other_columns.dtype
)
this.columns = Index(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why alter this.columns?

dedup_names(list(this.columns), False), dtype=this.columns.dtype
)
other.columns = Index(
dedup_names(list(other.columns), False), dtype=other.columns.dtype
)

# preserve column order
new_columns = self.columns.union(other_columns, sort=False)
new_columns_unique = self_columns.union(other_columns, sort=False)
do_fill = fill_value is not None
result = {}
for col in new_columns:
for col in new_columns_unique:
series = this[col]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of all the dedup_names stuff above and just iterate over range(this.shape[1]) and use series = this.iloc[:, i], other_series = other.iloc[:, i]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has to be other fixes because the logic heavily relied on column names instead of indices. I think this, other and new_columns(which result uses) would each need to have their own indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic heavily relied on column names instead of indices

how so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docstring of the function there is an example in which one dataframe has columns A,B while other has B,C. In that case it would be tricky to use index instead of column name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On L9100 we call self.align(other). After that, the columns are aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what align is doing now after seeing the docsting example with different column names.

other_series = other[col]

Expand Down Expand Up @@ -9158,7 +9176,9 @@ def combine(
result[col] = arr

# convert_objects just in case
frame_result = self._constructor(result, index=new_index, columns=new_columns)
frame_result = self._constructor(
result, index=new_index, columns=new_columns_out
)
return frame_result.__finalize__(self, method="combine")

def combine_first(self, other: DataFrame) -> DataFrame:
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/frame/methods/test_combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,17 @@ def test_combine_generic(self, float_frame):
)
tm.assert_frame_equal(chunk, exp)
tm.assert_frame_equal(chunk2, exp)

def test_combine_non_unique_columns(self):
# GH#51340
df = pd.DataFrame({"A": range(5), "B": range(5)})
df.columns = ["A", "A"]

other = df.copy()
df.iloc[1, :] = 11

def combiner(a, b):
return b

result = df.combine(other, combiner)
tm.assert_frame_equal(result, other)
Loading