Skip to content

Add support for scalar values with extension types #1301

@timsaucer

Description

@timsaucer

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Suppose I have a pyarrow scalar value that contains an extension type. If I try turning that into a literal expression in datafusion, we should get the associated metadata transparently to the user.

Consider this minimal example:

import pyarrow as pa
import uuid
from datafusion import lit

value = pa.scalar(uuid.uuid4().bytes, pa.uuid())

print(lit(value))

This currently fails with ArrowTypeError: Expected bytes, got a 'UUID' object. That can be overcome with the simple patch

--- a/src/pyarrow_util.rs
+++ b/src/pyarrow_util.rs
@@ -30,7 +30,11 @@ impl FromPyArrow for PyScalarValue {
     fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult<Self> {
         let py = value.py();
         let typ = value.getattr("type")?;
-        let val = value.call_method0("as_py")?;
+        let val = if value.hasattr("value")? {
+            value.getattr("value")?
+        } else {
+            value.call_method0("as_py")?
+        };

But then we still don't have the metadata. It is lost and we get a bare fixed sized binary.

Describe the solution you'd like

The above code should just work. I have done a little investigation and using the pycapsule interface we can get the schema of the array we generate inside PyScalarValue::from_pyarrow_bound. We can then plumb this through when calling lit().

Ideally we would take this opportunity to ensure that when we call PyScalarValue::from_pyarrow_bound we are also supporting other libraries besides just pyarrow. There has been a complaint a few times that we are too tightly coupled to pyarrow. In particular it would be good to demonstrate that when converting a Python object that is a scalar value it works for:

  • pyarrow
  • nanoarrow
  • arro3
  • polars

I don't think we necessarily need to support pandas since they are not an Arrow library.

Describe alternatives you've considered

Alternatively the user can manually turn their data into the underlying storage and then attach the metadata from their extension type. This feels like a poor user experience.

Additional context

This came up during a different investigation:

Also worth evaluating while we're doing this: For scalar values, is it possible for them to contain metadata? If I do pa.scalar(uuid.uuid4().bytes, type=pa.uuid()) and I check the type I should have the extension data. Maybe this is already supported, but as part of this PR I want to evaluate that as well.

Originally posted by @timsaucer in #1299 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions