-
Notifications
You must be signed in to change notification settings - Fork 23
Clean support for Non-Nvidia hardware vendors like Apple Silicon #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for non-Nvidia hardware (specifically Apple Silicon) by introducing a scheduler detection system that adapts execution strategies based on the underlying hardware. The changes enable the application to select between Flash Attention (optimized for Nvidia GPUs) and parallel head processing (for non-Nvidia hardware like Apple Silicon), along with conditional normalization steps.
Key changes:
- Introduced
SchedulerTypeparameter across FFN layer constructors and logits layers to enable hardware-specific code paths - Added conditional execution logic for attention mechanisms and normalization based on detected hardware
- Updated TornadoVM submodule reference to support non-Nvidia backends
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| AbstractFFNLayers.java | Added schedulerType field and shouldUseFinalNormalization() helper method to support hardware-specific optimizations |
| LlamaQ8_0FFNLayers.java | Integrated scheduler-based attention configuration and conditional normalization tasks |
| LlamaFP16FFNLayers.java | Integrated scheduler-based attention configuration and conditional normalization tasks |
| Qwen3Q8_0FFNLayers.java, Qwen2Q8_0FFNLayers.java, Phi3Q8_0FFNLayers.java | Updated constructors to accept and pass schedulerType parameter |
| Qwen3FP16FFNLayers.java, Qwen2FP16FFNLayers.java, Phi3FP16FFNLayers.java | Updated constructors to accept and pass schedulerType parameter |
| LogitsQ8_0Layer.java, LogitsFP16Layer.java | Added conditional normalization task based on scheduler type |
| QuantizedLayerPlanner.java | Integrated SchedulerDetectionService to determine hardware type |
| All LayerPlanner files | Updated layer instantiation to pass schedulerType parameter |
| TransformerComputeKernelsLayered.java | Added new processHeadsParallel method for non-Nvidia hardware |
| set_paths, external/tornadovm | Updated TornadoVM submodule and adjusted path configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel, | ||
| state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb, | ||
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(), |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contextLength() parameter appears twice in the method call on line 171. The parameter at position 8 is seqLen according to the method signature, but config.contextLength() is passed. Then contextLength is passed again at position 11. Verify that seqLen should be config.contextLength() rather than a different value like the current position.
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(), | |
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), state.positionHolder, |
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/LlamaFP16FFNLayers.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…THREAD_SCALE_FOR_LOGITS to 8. Adjust model loader logic for DEEPSEEK matching.
…/non-nvidia-hardware
| } else if (lowerName.contains("qwen3")) { | ||
| return ModelType.QWEN_3; | ||
| } else if (lowerName.contains("deepseek r1 distill")) { | ||
| } else if (lowerName.contains("deepseek")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we still need this
…te comment in AbstractLayer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel, | ||
| state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb, | ||
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(), |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter layerIndex is passed twice in the processHeadsParallel call - once as the second-to-last parameter and once as the last parameter. According to the method signature in TransformerComputeKernelsLayered.java (line 282-283), the parameters should be: q, key_cache, value_cache, xb, nHeads, headSize, kvDim, kvMul, seqLen, positionHolder, wrapAtt, layer, contextLength. Here layerIndex appears to be passed as both layer and contextLength, which is incorrect.
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(), | |
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), |
| return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel, | ||
| state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb, | ||
| config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(), | ||
| state.positionHolder, state.wrapAtt, layerIndex, config.contextLength()); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter layerIndex is passed twice in the processHeadsParallel call - once as the second-to-last parameter and once as the last parameter. According to the method signature in TransformerComputeKernelsLayered.java (line 282-283), the parameters should be: q, key_cache, value_cache, xb, nHeads, headSize, kvDim, kvMul, seqLen, positionHolder, wrapAtt, layer, contextLength. Here layerIndex appears to be passed as both layer and contextLength, which is incorrect.
|
closed for #66 |
No description provided.