-
Notifications
You must be signed in to change notification settings - Fork 69
C Data Interface and Tensor Support #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you split this to smaller PRs for easy to review?
Can we license the code as ASF copyrighted Apache license 2.0 safely? |
Yes definitely, the work is already divided into three separate commits so I can split this PR into three distinct PRs instead.
A valid concern of course. And I the truth is that there probably is no way to get certain since these models have been trained on a mix of copyrighted code. For this particular case I can be quite sure where the inspiration comes from since implementation instructions are created as a result of analysing the existing code bases mentioned in the PR and that no big architecture patterns are applied. In other words the AI cannot just come up with these types of solutions(yet) and need to be boxed in by instructions. It's a very important and interesting topic and honestly I have no ensurance to give. What I can do(and are willing to do) is to put in more work analysing each part of the contributed code for potential violations and document the process. So let me come back with three separate PRs, go over each of them searching for and cleaning up potential license violations. On the positive side my initial tests exceeded my expectations performance wise. |
|
@kou I have now created three distinct PRs:
Each can be reviewed/merged separately. Regarding Licensing and AI-Generated CodeWhat has been done for assurance/mitigation:
|
|
I found the ASF Generative Tooling Guidance: https://www.apache.org/legal/generative-tooling.html Could you follow the guidance? |
Thank you for the link. I made slight adjustments and I replied in the issue with the assessment: #184 (comment) |
Motivation
The C Data Interface is the formal, ABI-stable contract that allows different language runtimes to exchange complex datasets with zero serialization or memory copying overhead.
The Tensor and Sparse Tensor support aligns Julia with the needs of machine learning and artificial intelligence communities, where multi-dimensional arrays are the fundamental data structure.
Features
Apache Arrow C Data Interface
Core C Data Interface
types)
Export Functionality
Import Functionality
Dense Tensor Support
Dense Tensor Types
Arrow Integration
Tensor Operations
Sparse Tensor Support
Sparse Tensor Type Hierarchy
COO (Coordinate) Format
CSX (Compressed Sparse Row/Column) Format
CSF (Compressed Sparse Fiber) Format
Sparse Tensor Serialization
Julia Ecosystem Integration
Engineering Challenges and Mitigation
Memory Safety at the GC/FFI Boundary
Description
The primary risk is the impedance mismatch between Julia's automatic garbage collection and the C Data Interface's manual release callback mechanism. Failure to correctly manage this boundary can lead to use-after-free errors or memory leaks.
Mitigation
Using @cfunction and a guardian object to prevent premature garbage collection. For import, it requires the correct use of finalizers to ensure the producer's release callback is always called. The memory management patterns in arrow-rs (using Box::into_raw and ManuallyDrop) and pyarrow serve as inspiration.
ABI and Format String Correctness
Description
The C Data Interface is an ABI specification. Any deviation in struct layout or incorrect generation/parsing of the format string will lead to data corruption or crashes when communicating with other Arrow libraries.
Mitigation
The implementation of the Julia structs must precisely match the C specification. An large amount of test has been created to validate the format string generation and parsing logic for every supported Arrow data type, including all primitive, temporal, and nested variations defined in the specification.
Complexity of the CSF Sparse Format
Description
The Compressed Sparse Fiber format is significantly more complex than the other sparse formats due to its recursive, hierarchical structure.
Mitigation
The implementation are heavily guided by the formal FlatBuffers specification (SparseTensor.fbs) and by studying the existing, mature implementations in the Arrow C++ and Rust libraries.
AI generated code
Description
While having a career worth of coding experience, the code is mostly generated using claude.
Mitigation
I have designed/architected the solution upfront, provided a plan with granular phase and step prompts to mitigate context rot and drift.