Skip to content

Conversation

@rabernat
Copy link

@rabernat rabernat commented Dec 3, 2025

This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.

This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.

Implementation: zarr-developers/zarr-python#3613

@LDeakin
Copy link
Member

LDeakin commented Dec 3, 2025

Nice one, Ryan. This looks easy enough to support.

Two thoughts:

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used
  • Permit a 2D array input as well, so that multiple columns of the same data type could be stored


## Configuration parameters

- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.

@jbms
Copy link
Contributor

jbms commented Dec 3, 2025

In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies.

While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression.

@rabernat
Copy link
Author

rabernat commented Dec 3, 2025

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

  • Fail at runtime when the Arrow IPC codec is unable to process the ND array
  • Implement some kind of dependency system between codecs to make sure they are compatible

So I'd prefer to keep flattening as part of the codec itself.

@jbms
Copy link
Contributor

jbms commented Dec 3, 2025

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

  • Fail at runtime when the Arrow IPC codec is unable to process the ND array

The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.

  • Implement some kind of dependency system between codecs to make sure they are compatible

Yes, optionally the implementation could automatically insert a reshape codec.

So I'd prefer to keep flattening as part of the codec itself.

Nonetheless it seems reasonable to convert to 1d automatically, just like the bytes and packbits codecs.

Note: There are also the arrow Tensor and SparseTensor message types (https://github.com/apache/arrow/blob/main/format/Tensor.fbs) --- they seem to be independent of the Schema and RecordBatch messages. While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

@jbms
Copy link
Contributor

jbms commented Dec 3, 2025

The codec description should say explicitly that the encoded representation should start with a Schema message if that is the intention. The Schema can anyway be determined from the array data type but including it would presumably allow the raw chunk file to be more easily consumed by other software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants