Kda mixer #395

oleksost · 2025-11-26T16:23:30Z

✨ Description

Should be merged after GDN #392 .

Adding KDA mixer from Kimi Lienar.

Note, for now this requires nightly triton and pytorch, see: https://github.com/fla-org/flash-linear-attention/blob/main/FAQs.md.

Added a new docker file for KDA image. I uploaded it to registry.toolkit-sp.yul201.service-now.com/snow.research.afm/kda_image:kda_torch_nightly

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added kda.py in ssm layers
added kda to varlen test
added hybrid_kda to model configs for testing
added a Dockerfile for kda

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

Some comments, most also apply to GDA

jlamypoirier · 2025-12-05T20:27:55Z

Dockerfile

 # The image is still compatible with any user id.
 RUN useradd user
-USER user
+USER user


Unnecessary diff

jlamypoirier · 2025-12-05T20:32:37Z

fast_llm/layers/ssm/config.py

        super()._validate()


+@config_class(dynamic_type={MixerConfig: "kda"})


"kimi_delta_attention"

jlamypoirier · 2025-12-05T20:33:49Z

fast_llm/layers/ssm/config.py

+        desc="Configuration for the gated normalization applied to the KDA output.",
+        hint=FieldHint.architecture,
+    )
+    q_projection_layer: AffineLinearConfig = Field(


projection seems unnecessary in these fields.

jlamypoirier · 2025-12-05T20:34:29Z

fast_llm/layers/ssm/config.py

+    )
+
+    @property
+    def layer_class(self) -> "type":


type["KimiDeltaAttention"]

jlamypoirier · 2025-12-05T20:36:33Z

fast_llm/layers/ssm/config.py

+        return KimiDeltaAttention
+
+    def _validate(self) -> None:
+        with self._set_implicit_default():


Not sure that's a good idea, it makes configs hard to understand. Better assume the user to specify these explicitly. (and most of the time we're creating from HF so that's not a problem)

jlamypoirier · 2025-12-05T20:56:32Z

tests/layers/test_kda_equivalence.py

+
+
+@pytest.mark.slow
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")


pytest.mark.requires_cuda

jlamypoirier · 2025-12-05T21:04:03Z

tests/layers/test_kda_equivalence.py

+    AprielHybridSSMConfig, KimiDeltaAttention = None, None
+
+
+def _materialize_mixer_tensors(module: torch.nn.Module, distributed: Distributed, device: torch.device) -> None:


Please use get_stage, it already does this. See example here https://github.com/ServiceNow/Fast-LLM/blob/main/tests/layers/test_lm_head.py#L264

Also please don't copy utils to every file, they can go in utils

jlamypoirier · 2025-12-05T21:12:44Z

tests/layers/test_kda_equivalence.py

+@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")
+@pytest.mark.skipif(KimiDeltaAttention is None or AprielHybridSSMConfig is None, reason="Apriel KDA deps missing")
+@pytest.mark.skipif(kda_module.chunk_kda is None, reason="KDA fused kernels not available")
+def test_fast_llm_kda_matches_apriel_forward():


Not sure we need this test at all. test_huggingface_model already tests the equivalence

jlamypoirier · 2025-12-05T21:15:04Z

tests/utils/model_configs.py

+        ModelTestingGroup.convert: ModelTestingGroupAction.normal,
+        ModelTestingGroup.generate: ModelTestingGroupAction.not_implemented,
+        ModelTestingGroup.megatron: ModelTestingGroupAction.not_implemented,
+        ModelTestingGroup.distributed: ModelTestingGroupAction.normal,


We might want to test once and leave as unimportant, this has a huge impact on testing time.

@main

Update to nvcr.io/nvidia/pytorch:25.11-py3 which includes: - PyTorch 2.10 - CUDA 13.0 - flash-attn 2.7.4.post1 (pre-installed, no compilation needed) Dependency updates: - causal-conv1d: v1.5.4 (was pinned to commit 2a288a1) - mamba-ssm: 2.2.6.post3 (was pinned to commit 4a8a2a2) - flash-linear-attention: pin to commit 67eee20 (was @main) - flash-attn: 2.7.4.post1 to match base image (was 2.7.3) - triton: 3.5.1 in Dockerfile (was 3.1.0) These updates enable Kimi Delta Attention (KDA) support via the flash-linear-attention library. The pinned versions are tested and working, unlike the nightly/unpinned approach in #395. Note: Dropless MoE kernel remains broken with triton >= 3.2.0 and needs a complete rewrite (also limited to 32 experts). This is tracked separately and doesn't block KDA work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

oleksost added 27 commits November 25, 2025 21:43

wip

1a219c4

added gdn

5242eb6

gdn layer

bec22de

kda

7f79909

wip

8636f09

convertion kda

a20c958

tp and sequence tp

8ac5167

varlen kda

f1a51f2

gdn only: varlen test

3b367d8

clean up

c48d4ee

test config

e2bb25c

wip

d4f9b85

gdn tests

8017a80

tests

1e01601

tests

ca8cb5c

nvm

694d287

requirements

d3bd916

wip

3ff7799

clean up

9a53c5b

conversion

80041ce

Merge branch 'gdn' into kda

67a234a

comments on the layour + HF forward equivalence test

d6677b0

Merge branch 'gdn' into kda

a75cd9f

wip

5d3b6d0

wip

0d41dce

varlen test

6e2c1fe

varlen test

8938a1d

oleksost requested review from jlamypoirier and tscholak November 26, 2025 20:40

oleksost marked this pull request as ready for review November 26, 2025 20:40

oleksost and others added 18 commits November 26, 2025 20:46

wip

6c2bd46

wip

28a6176

wip

2a30bac

kda equivalence test

cad93ab

nightly requirements

8f957a4

docker

82c9cc4

manual build

d25994e

Merge branch 'main' into gdn

3651b06

Merge branch 'gdn' into kda

7b31e78

two docker files

5a44097

test import fix

a164a2b

set correct activations

c4aa9b1

import

a8849cb

kda docker file

5f32ba7

Merge branch 'main' into kda

1dde2a9

Merge remote-tracking branch 'origin/main' into kda

685f351

revert workflow change

d33d6d7

removed unused requirements file

05abc03

jlamypoirier reviewed Dec 5, 2025

View reviewed changes

tscholak mentioned this pull request Dec 7, 2025

Bump base image and dependencies for KDA support #404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kda mixer #395

Kda mixer #395

oleksost commented Nov 26, 2025 •

edited

Loading

Uh oh!

jlamypoirier left a comment

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

jlamypoirier Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		super()._validate()


		@config_class(dynamic_type={MixerConfig: "kda"})



		@pytest.mark.slow
		@pytest.mark.skipif(not torch.cuda.is_available(), reason="KDA equivalence test needs CUDA")

		AprielHybridSSMConfig, KimiDeltaAttention = None, None


		def _materialize_mixer_tensors(module: torch.nn.Module, distributed: Distributed, device: torch.device) -> None:

Kda mixer #395

Are you sure you want to change the base?

Kda mixer #395

Conversation

oleksost commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksost commented Nov 26, 2025 •

edited

Loading