Skip to content

Conversation

@ollemartensson
Copy link

@ollemartensson ollemartensson commented Aug 30, 2025

Motivation

The C Data Interface is the formal, ABI-stable contract that allows different language runtimes to exchange complex datasets with zero serialization or memory copying overhead.
The Tensor and Sparse Tensor support aligns Julia with the needs of machine learning and artificial intelligence communities, where multi-dimensional arrays are the fundamental data structure.

Features

Apache Arrow C Data Interface

Core C Data Interface

  • ArrowSchema and ArrowArray C-compatible struct definitions
  • FFI bindings with proper ABI compatibility
  • Format string protocol implementation (primitive, temporal, nested
    types)
  • Memory management with release callbacks

Export Functionality

  • Zero-copy export from Julia ArrowVector to C pointers
  • Guardian object pattern for GC-safe memory management
  • @cfunction integration for release callbacks
  • Recursive export support for nested types

Import Functionality

  • Zero-copy import from C pointers to Julia ArrowVector
  • Finalizer-based foreign memory management
  • Format string parsing and type reconstruction
  • Cross-language validation framework

Dense Tensor Support

Dense Tensor Types

  • DenseTensor{T,N} with full AbstractArray interface
  • Zero-copy wrapper around Arrow FixedSizeList
  • Multi-dimensional indexing with row-major storage
  • Support for dimension names and permutations

Arrow Integration

  • arrow.fixed_shape_tensor canonical extension type
  • JSON metadata serialization (shape, dim_names, permutation)
  • ArrowTypes.jl interface for automatic serialization
  • Round-trip serialization/deserialization

Tensor Operations

  • Linear indexing with permutation support
  • Element access and modification
  • Conversion from Julia AbstractArrays
  • Display and pretty-printing

Sparse Tensor Support

Sparse Tensor Type Hierarchy

  • AbstractSparseTensor{T,N} abstract base type
  • Full AbstractArray interface implementation
  • nnz() function for non-zero element counting

COO (Coordinate) Format

  • SparseTensorCOO{T,N} for general sparse tensors
  • Explicit coordinate and value storage
  • Support for any dimensionality (1D, 2D, 3D, N-D)
  • Conversion from Julia SparseMatrixCSC

CSX (Compressed Sparse Row/Column) Format

  • SparseTensorCSX{T} for 2D sparse matrices
  • CSR and CSC compression variants
  • Index pointer optimization for sparse linear algebra
  • Direct integration with Julia SparseMatrixCSC

CSF (Compressed Sparse Fiber) Format

  • SparseTensorCSF{T,N} for advanced N-dimensional compression
  • Hierarchical tree-like compression structure
  • Foundation for high-performance sparse tensor operations

Sparse Tensor Serialization

  • arrow.sparse_tensor extension type with format metadata
  • JSON metadata encoding (format_type, shape, nnz, compression_axis)
  • Format-specific buffer layouts (COO, CSR/CSC, CSF)
  • Round-trip serialization preserving all properties

Julia Ecosystem Integration

  • Native SparseArrays.jl compatibility
  • Automatic conversion utilities
  • Memory-efficient storage with high compression ratios
  • Cross-language interoperability via Arrow standard

Engineering Challenges and Mitigation

Memory Safety at the GC/FFI Boundary

Description

The primary risk is the impedance mismatch between Julia's automatic garbage collection and the C Data Interface's manual release callback mechanism. Failure to correctly manage this boundary can lead to use-after-free errors or memory leaks.

Mitigation

Using @cfunction and a guardian object to prevent premature garbage collection. For import, it requires the correct use of finalizers to ensure the producer's release callback is always called. The memory management patterns in arrow-rs (using Box::into_raw and ManuallyDrop) and pyarrow serve as inspiration.

ABI and Format String Correctness

Description

The C Data Interface is an ABI specification. Any deviation in struct layout or incorrect generation/parsing of the format string will lead to data corruption or crashes when communicating with other Arrow libraries.

Mitigation

The implementation of the Julia structs must precisely match the C specification. An large amount of test has been created to validate the format string generation and parsing logic for every supported Arrow data type, including all primitive, temporal, and nested variations defined in the specification.  

Complexity of the CSF Sparse Format

Description

The Compressed Sparse Fiber format is significantly more complex than the other sparse formats due to its recursive, hierarchical structure.

Mitigation

The implementation are heavily guided by the formal FlatBuffers specification (SparseTensor.fbs) and by studying the existing, mature implementations in the Arrow C++ and Rust libraries.

AI generated code

Description

While having a career worth of coding experience, the code is mostly generated using claude.

Mitigation

I have designed/architected the solution upfront, provided a plan with granular phase and step prompts to mitigate context rot and drift.

@kou
Copy link
Member

kou commented Aug 31, 2025

Could you split this to smaller PRs for easy to review?
For example, we don't need to mix C data interface support and tensor support, right?

While having a career worth of coding experience, the code is mostly generated using claude.

Can we license the code as ASF copyrighted Apache license 2.0 safely?
(Can we ensure that the code doesn't include any copyrighted code?)

@ollemartensson
Copy link
Author

Could you split this to smaller PRs for easy to review?

For example, we don't need to mix C data interface support and tensor support, right?

Yes definitely, the work is already divided into three separate commits so I can split this PR into three distinct PRs instead.

While having a career worth of coding experience, the code is mostly generated using claude.

Can we license the code as ASF copyrighted Apache license 2.0 safely?

(Can we ensure that the code doesn't include any copyrighted code?)

A valid concern of course. And I the truth is that there probably is no way to get certain since these models have been trained on a mix of copyrighted code.
However the same goes for human written code as well I suppose.

For this particular case I can be quite sure where the inspiration comes from since implementation instructions are created as a result of analysing the existing code bases mentioned in the PR and that no big architecture patterns are applied. In other words the AI cannot just come up with these types of solutions(yet) and need to be boxed in by instructions.

It's a very important and interesting topic and honestly I have no ensurance to give.

What I can do(and are willing to do) is to put in more work analysing each part of the contributed code for potential violations and document the process.

So let me come back with three separate PRs, go over each of them searching for and cleaning up potential license violations.

On the positive side my initial tests exceeded my expectations performance wise.

@ollemartensson
Copy link
Author

@kou I have now created three distinct PRs:

  1. C Data Interface - Zero-copy interoperability (37 tests)
  2. Dense Tensor Support - arrow.fixed_shape_tensor extension (61 tests)
  3. Sparse Tensor Support - COO, CSR/CSC, CSF formats (113 tests)

Each can be reviewed/merged separately.

Regarding Licensing and AI-Generated Code

What has been done for assurance/mitigation:

  1. Research-Based Foundation: The architectures and approaches come from my analysis of Apache Arrow
    specifications and existing implementations, not from AI creativity. The AI served as an implementation tool
    following my technical direction.

  2. Standard Algorithms: The code implements well-established patterns:

    • C Data Interface follows the official Arrow ABI specification exactly
    • Dense tensors implement the canonical arrow.fixed_shape_tensor extension
    • Sparse formats use standard COO, CSR/CSC, CSF algorithms from academic literature
  3. Transparent Process: All commits include clear AI assistance attribution, establishing the research →
    AI implementation → review workflow.

@kou
Copy link
Member

kou commented Sep 1, 2025

I found the ASF Generative Tooling Guidance: https://www.apache.org/legal/generative-tooling.html

Could you follow the guidance?

@kou
Copy link
Member

kou commented Sep 1, 2025

Could you open issues for dense tensors and sparse tensors?
Could you also add Fix #XXX that refer the created issues to #562 and #563?

@ollemartensson
Copy link
Author

I found the ASF Generative Tooling Guidance: https://www.apache.org/legal/generative-tooling.html

Could you follow the guidance?

Thank you for the link. I made slight adjustments and I replied in the issue with the assessment: #184 (comment)

@ollemartensson
Copy link
Author

Could you open issues for dense tensors and sparse tensors? Could you also add Fix #XXX that refer the created issues to #562 and #563?

Yes absolutely, I'll create the issues and link them properly with the PR:s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants