-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jorisvandenbossche
wants to merge
7
commits into
pandas-dev:main
Choose a base branch
from
jorisvandenbossche:arrow-capsule-import
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
b63e601
ENH: add basic DataFrame.from_arrow class method for importing throug…
jorisvandenbossche 6901e6d
add validation
jorisvandenbossche 6af237c
add return type
jorisvandenbossche fad6bb1
add type hints and protocol definitions
jorisvandenbossche d3b8927
Merge remote-tracking branch 'upstream/main' into arrow-capsule-import
jorisvandenbossche 5cccaab
update link
jorisvandenbossche fa4eb11
add whatsnew note
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -215,6 +215,8 @@ | |
| AnyAll, | ||
| AnyArrayLike, | ||
| ArrayLike, | ||
| ArrowArrayExportable, | ||
| ArrowStreamExportable, | ||
| Axes, | ||
| Axis, | ||
| AxisInt, | ||
|
|
@@ -1832,6 +1834,54 @@ def __rmatmul__(self, other) -> DataFrame: | |
| # ---------------------------------------------------------------------- | ||
| # IO methods (to / from other formats) | ||
|
|
||
| @classmethod | ||
| def from_arrow( | ||
| cls, data: ArrowArrayExportable | ArrowStreamExportable | ||
| ) -> DataFrame: | ||
| """ | ||
| Construct a DataFrame from a tabular Arrow object. | ||
|
|
||
| This function accepts any Arrow-compatible tabular object implementing | ||
| the `Arrow PyCapsule Protocol`_ (i.e. having an ``__arrow_c_array__`` | ||
| or ``__arrow_c_stream__`` method). | ||
|
|
||
| This function currently relies on ``pyarrow`` to convert the tabular | ||
| object in Arrow format to pandas. | ||
|
|
||
| .. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
|
|
||
| .. versionadded:: 3.0 | ||
|
|
||
| Parameters | ||
| ---------- | ||
| data : pyarrow.Table or Arrow-compatible table | ||
| Any tabular object implementing the Arrow PyCapsule Protocol | ||
| (i.e. has an ``__arrow_c_array__`` or ``__arrow_c_stream__`` | ||
| method). | ||
|
|
||
| Returns | ||
| ------- | ||
| DataFrame | ||
|
|
||
| """ | ||
| pa = import_optional_dependency("pyarrow", min_version="14.0.0") | ||
| if not isinstance(data, pa.Table): | ||
| if not ( | ||
| hasattr(data, "__arrow_c_array__") | ||
| or hasattr(data, "__arrow_c_stream__") | ||
| ): | ||
| # explicitly test this, because otherwise we would accept variour other | ||
| # input types through the pa.table(..) call | ||
| raise TypeError( | ||
| "Expected an Arrow-compatible tabular object (i.e. having an " | ||
| "'_arrow_c_array__' or '__arrow_c_stream__' method), got " | ||
| f"'{type(data).__name__}' instead." | ||
| ) | ||
| data = pa.table(data) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if you give |
||
|
|
||
| df = data.to_pandas() | ||
| return df | ||
|
|
||
| @classmethod | ||
| def from_dict( | ||
| cls, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work for things that only expose
__arrow_c_array__?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing
__arrow_c_array__is necessary but not sufficient. BothArrayandRecordBatchexpose the same__arrow_c_array__interface. It's overloaded to be able to interpret aRecordBatchas the same as anArraywith typeStruct.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to be fair, RecordBatch has both
__arrow_c_array__and__arrow_c_stream__dunder methods, so just testing with RecordBatch does not actually prove thatpa.table(..)works with objects that only implement the array version. But because I wrap the record batch in the tests in a dummy object that only exposes__arrow_c_array__, the tests should cover this and assertDataFrame.from_arrow()works with both dunder methods.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right?
I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array).
I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data".
But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to
polars.from_arrow, because it's missing the user intention of "this object is a series" or "this object is a DataFrame".I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A theoretical example is a library that produces Arrow data thinking that they need to implement
__arrow_c_array__for their "Table" equivalent since they did so for their RecordBatch equivalent. If the Table contained multiple chunks of data, I assume they would need to combine all of the chunks to pass data on through the__arrow_c_array__interfaceThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy.
I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK great - thanks for sharing. I'll track that issue upstream