-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suppose I have a pyarrow scalar value that contains an extension type. If I try turning that into a literal expression in datafusion, we should get the associated metadata transparently to the user.
Consider this minimal example:
import pyarrow as pa
import uuid
from datafusion import lit
value = pa.scalar(uuid.uuid4().bytes, pa.uuid())
print(lit(value))This currently fails with ArrowTypeError: Expected bytes, got a 'UUID' object. That can be overcome with the simple patch
--- a/src/pyarrow_util.rs
+++ b/src/pyarrow_util.rs
@@ -30,7 +30,11 @@ impl FromPyArrow for PyScalarValue {
fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult<Self> {
let py = value.py();
let typ = value.getattr("type")?;
- let val = value.call_method0("as_py")?;
+ let val = if value.hasattr("value")? {
+ value.getattr("value")?
+ } else {
+ value.call_method0("as_py")?
+ };But then we still don't have the metadata. It is lost and we get a bare fixed sized binary.
Describe the solution you'd like
The above code should just work. I have done a little investigation and using the pycapsule interface we can get the schema of the array we generate inside PyScalarValue::from_pyarrow_bound. We can then plumb this through when calling lit().
Ideally we would take this opportunity to ensure that when we call PyScalarValue::from_pyarrow_bound we are also supporting other libraries besides just pyarrow. There has been a complaint a few times that we are too tightly coupled to pyarrow. In particular it would be good to demonstrate that when converting a Python object that is a scalar value it works for:
- pyarrow
- nanoarrow
- arro3
- polars
I don't think we necessarily need to support pandas since they are not an Arrow library.
Describe alternatives you've considered
Alternatively the user can manually turn their data into the underlying storage and then attach the metadata from their extension type. This feels like a poor user experience.
Additional context
This came up during a different investigation:
Also worth evaluating while we're doing this: For scalar values, is it possible for them to contain metadata? If I do
pa.scalar(uuid.uuid4().bytes, type=pa.uuid())and I check thetypeI should have the extension data. Maybe this is already supported, but as part of this PR I want to evaluate that as well.
Originally posted by @timsaucer in #1299 (comment)