-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
String dtype: fix isin() values handling for python storage #59759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: fix isin() values handling for python storage #59759
Conversation
| if not lib.is_string_array(np.asarray(values), skipna=True): | ||
| values = np.array( | ||
| [val for val in values if isinstance(val, str) or isna(val)], | ||
| dtype=object, | ||
| ) | ||
| if not len(values): | ||
| return np.zeros(self.shape, dtype=bool) | ||
|
|
||
| values = self._from_sequence(values, dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that we convert the input values to a StringArray first, to make use of the construction validation (coercing to one missing value sentinel).
But, this is only done if we actually get strings passed, and if not, we filter out non-strings. This is also what is done in the ArrowStringArray implementation, and this is to ensure we accept something like isin(["string", 10]) with mixed-type values (I do wonder if we should deprecate that in general, but that's for a later, separate discussion).
|
|
||
| def isin(self, values: ArrayLike) -> npt.NDArray[np.bool_]: | ||
| if not isinstance(values, BaseStringArray): | ||
| if not lib.is_string_array(np.asarray(values), skipna=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth short-circuiting for the case where values is all NA and self._hasna is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it worth short-circuiting this check when the values.dtype is ArrowDtype(pa.string())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth short-circuiting for the case where
valuesis allNAandself._hasnaisFalse?
I would say that is probably not worth the added complexity for something that is just speeding up an uncommon corner case? (also, checking self._hasna has a cost)
Also, is it worth short-circuiting this check when the
values.dtypeisArrowDtype(pa.string())?
I wanted to do that, but then I would need to import ArrowDtype and pyarrow inline (because this file is for the object string dtype, that works without pyarrow), so at that point I was again unsure if we should add a special case here
(I think a typical use case is passing a list, and that will at the moment never be coerced to ArrowDtype before it gets here)
Now, what we should maybe do to essentially speed up this case is to ensure lib.is_string_array shortcuts on pyarrow strings and directly returns True, and ensure that StringArray._from_sequence shortcuts on pyarrow strings to ensure the most efficient conversion (if it doesn't do that already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can probably check for ArrowDtype(pa.string()) without importing it by comparing for equality with "string[pyarrow]"
|
Thanks @jorisvandenbossche |
…ev#59759) * String dtype: fix isin() values handling for python storage * address feedback
* String dtype: fix isin() values handling for python storage * address feedback
This PR adds a
StringArray.isin()implementation (ArrowStringArrayalready has a customisin()), so we can add custom processing of the passedvaluesset compared to the base class implementation.See also the discussion at #58451 (comment) about the expected behaviour.
xref #54792