Skip to content

Conversation

@mikepapadim
Copy link
Member

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for non-Nvidia hardware (specifically Apple Silicon) by introducing a scheduler detection system that adapts execution strategies based on the underlying hardware. The changes enable the application to select between Flash Attention (optimized for Nvidia GPUs) and parallel head processing (for non-Nvidia hardware like Apple Silicon), along with conditional normalization steps.

Key changes:

  • Introduced SchedulerType parameter across FFN layer constructors and logits layers to enable hardware-specific code paths
  • Added conditional execution logic for attention mechanisms and normalization based on detected hardware
  • Updated TornadoVM submodule reference to support non-Nvidia backends

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
AbstractFFNLayers.java Added schedulerType field and shouldUseFinalNormalization() helper method to support hardware-specific optimizations
LlamaQ8_0FFNLayers.java Integrated scheduler-based attention configuration and conditional normalization tasks
LlamaFP16FFNLayers.java Integrated scheduler-based attention configuration and conditional normalization tasks
Qwen3Q8_0FFNLayers.java, Qwen2Q8_0FFNLayers.java, Phi3Q8_0FFNLayers.java Updated constructors to accept and pass schedulerType parameter
Qwen3FP16FFNLayers.java, Qwen2FP16FFNLayers.java, Phi3FP16FFNLayers.java Updated constructors to accept and pass schedulerType parameter
LogitsQ8_0Layer.java, LogitsFP16Layer.java Added conditional normalization task based on scheduler type
QuantizedLayerPlanner.java Integrated SchedulerDetectionService to determine hardware type
All LayerPlanner files Updated layer instantiation to pass schedulerType parameter
TransformerComputeKernelsLayered.java Added new processHeadsParallel method for non-Nvidia hardware
set_paths, external/tornadovm Updated TornadoVM submodule and adjusted path configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

} else {
return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel,
state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb,
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(),
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contextLength() parameter appears twice in the method call on line 171. The parameter at position 8 is seqLen according to the method signature, but config.contextLength() is passed. Then contextLength is passed again at position 11. Verify that seqLen should be config.contextLength() rather than a different value like the current position.

Suggested change
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(),
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), state.positionHolder,

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings November 7, 2025 16:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…THREAD_SCALE_FOR_LOGITS to 8. Adjust model loader logic for DEEPSEEK matching.
} else if (lowerName.contains("qwen3")) {
return ModelType.QWEN_3;
} else if (lowerName.contains("deepseek r1 distill")) {
} else if (lowerName.contains("deepseek")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we still need this

Copilot AI review requested due to automatic review settings November 11, 2025 10:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

} else {
return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel,
state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb,
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(),
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter layerIndex is passed twice in the processHeadsParallel call - once as the second-to-last parameter and once as the last parameter. According to the method signature in TransformerComputeKernelsLayered.java (line 282-283), the parameters should be: q, key_cache, value_cache, xb, nHeads, headSize, kvDim, kvMul, seqLen, positionHolder, wrapAtt, layer, contextLength. Here layerIndex appears to be passed as both layer and contextLength, which is incorrect.

Suggested change
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(),
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(),

Copilot uses AI. Check for mistakes.
return unifiedLayer.task("parallel-attention", TransformerComputeKernelsLayered::processHeadsParallel,
state.wrapQ, state.wrapKeyCache, state.wrapValueCache, state.wrapXb,
config.numberOfHeads(), config.headSize(), config.kvDim(), config.kvMul(), config.contextLength(),
state.positionHolder, state.wrapAtt, layerIndex, config.contextLength());
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter layerIndex is passed twice in the processHeadsParallel call - once as the second-to-last parameter and once as the last parameter. According to the method signature in TransformerComputeKernelsLayered.java (line 282-283), the parameters should be: q, key_cache, value_cache, xb, nHeads, headSize, kvDim, kvMul, seqLen, positionHolder, wrapAtt, layer, contextLength. Here layerIndex appears to be passed as both layer and contextLength, which is incorrect.

Copilot uses AI. Check for mistakes.
@mikepapadim
Copy link
Member Author

closed for #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants