activation-level disillation #388

RaymondLi0 · 2025-11-14T20:48:27Z

✨ Description

Closes #385

TODOs:

TP / sequence-tensor parallel: inconsistent gradients between TP=1 and TP=2
Add tests: train with student==teacher, check that all losses are 0 and gradients as well.

Sanity checks:

loading student and teacher from the same pretrained model gives 0 loss ✔️. But loss then increases to a small value instead of staying at 0.
Distilling from scratch with the same architecture doesn't lead to 0 loss (orange)
Distilling the pretrained model, but with a sliding-window leads to low loss, lower with a larger window (green and purple)

	lm-loss	Logit distillation	Logit + Activation distillation
Tokens/s/gpu	3500	2900	2800
max_reserved (GB)	44	77	78

With the caveat that distillation seems to experience memory spikes at specific points in training. The actual usage was lower most of the time:

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

…lation

tscholak · 2025-11-19T03:55:14Z

great progress! did you freeze everything except the randomly initialized mixers?

…tion

RaymondLi0 · 2025-11-21T20:19:25Z

Another sanity check:
Student is initialized from the teacher, but with randomly initialized attention layers.
Then we distill activations while freezing the MLPs.

The loss (grey) is still quite far from 0 (similarly to the orange run where the student is initialized completely from scratch)

RaymondLi0 · 2025-11-26T22:13:41Z

Resetting and distilling only one layer, freezing the rest of the model gives satisfactory results:

Note some changes were required to allow loading a pretrained model while freezing certain layers (#394 )

RaymondLi0 · 2025-12-02T20:50:07Z

Now getting similar loss curves with TP=1, TP=2, and STP=2

jlamypoirier · 2025-12-02T23:14:01Z

fast_llm/engine/base_model/base_model.py

        phase: PhaseType,
        iteration: int,
        metrics: dict | None = None,
+        setup_activation_storage: bool = False,


Not needed, you can communicate through preprocessed meta kwargs.

jlamypoirier · 2025-12-02T23:22:19Z

tests/utils/model_configs.py

+        ("model", "base_model", "head", "distillation_model"): "teacher",
+        ("reference_models"): {
+            "teacher": {
+                "model": {


You'll need complete model descriptions, ex. copied from another, otherwise the created model will be too big.

jlamypoirier · 2025-12-02T23:23:55Z

tests/utils/model_configs.py

+        ModelTestingGroup.convert: ModelTestingGroupAction.unimportant,
+        ModelTestingGroup.generate: ModelTestingGroupAction.unimportant,
+        ModelTestingGroup.megatron: ModelTestingGroupAction.not_implemented,
+        ModelTestingGroup.distributed: ModelTestingGroupAction.broken,  # failing: tp2, stp2, stp2_ce4


We'll probably want to leave these as unimportant and run once in a while, because the testing suite can't really support many distributed runs.

jlamypoirier · 2025-12-02T23:27:04Z

fast_llm/layers/decoder/block.py

+        """
+        Maybe apply activation distillation loss and setup backward hooks
+        """
+        mixer_output = hidden_states if bias is None else hidden_states + bias


This should only be evaluated if needed.

jlamypoirier · 2025-12-02T23:32:20Z

fast_llm/layers/decoder/block.py

+        mixer_output = hidden_states if bias is None else hidden_states + bias
+        # Teacher populates mixer activations for distillation.
+        activation_storage = kwargs.get(BlockKwargs.activation_distillation_storage)
+        if activation_storage is not None:


Consider using the new _debug / output_hidden_states interface instead? It does the exact same thing.

RaymondLi0 added 16 commits November 12, 2025 21:15

activation distillation: first draft

9fa4c46

fix kwargs

11708ff

remove count, add auxiliaryLoss hook

9437310

fix auxiliary loss

d3ac964

wrap in method

56fc8db

fixes

5d75f01

move activation distillation loss reporting to decoder block

f1bfca9

fix logging

8b16752

remove root kwargs

efa8cf0

fix mistral mlp conversion

4cda56d

Merge branch 'raymond/fix_mistral_conv' into raymond/activation_disil…

9ca2347

…lation

remove duplicate from apriel conversion

41692e9

fix

99c42c0

move assert

d3df7a5

Merge branch 'raymond/fix_mistral_conv' into raymond/activation_disil…

f2f097e

…lation

remove tp-1 check for reference models

8e04aba

RaymondLi0 added 2 commits November 20, 2025 14:56

fix reduce op

3ebda84

Merge branch 'raymond/fix_distill_tp' into raymond/activation_disilla…

6a8732f

…tion

RaymondLi0 added 6 commits November 21, 2025 21:33

try: loss after norm

f729625

handle non-fixed-sequence decoder

0effa24

patch creeping config params

f7a0837

support pattern-block-sequence with compatible configs in export

6f2d5e3

move activation-distillation loss pre-norm again

90da831

support PatternBlockSequenceConfig in llama converter

5251719

RaymondLi0 added 3 commits December 1, 2025 21:16

add distillation tests

d2858d6

update tests

280db13

update tests, add reverse_kl

a46ed18

RaymondLi0 added 2 commits December 2, 2025 15:44

Merge branch 'main' into raymond/activation_disillation

01b5530

remove comments

6e42944

RaymondLi0 marked this pull request as ready for review December 2, 2025 15:53

RaymondLi0 added 2 commits December 2, 2025 19:32

handle stp

4c75e10

set distillation test as broken

035d36c

RaymondLi0 requested review from jlamypoirier, oleksost and tscholak December 2, 2025 20:58

remove unused code

e3ac422

jlamypoirier reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

activation-level disillation #388

activation-level disillation #388

RaymondLi0 commented Nov 14, 2025 •

edited

Loading

Uh oh!

tscholak commented Nov 19, 2025

Uh oh!

RaymondLi0 commented Nov 21, 2025 •

edited

Loading

Uh oh!

RaymondLi0 commented Nov 26, 2025

Uh oh!

RaymondLi0 commented Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

activation-level disillation #388

Are you sure you want to change the base?

activation-level disillation #388

Conversation

RaymondLi0 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Testing

Performance Impact

📊 Performance Impact Details

Uh oh!

tscholak commented Nov 19, 2025

Uh oh!

RaymondLi0 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaymondLi0 commented Nov 26, 2025

Uh oh!

RaymondLi0 commented Dec 2, 2025

Uh oh!

jlamypoirier Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RaymondLi0 commented Nov 14, 2025 •

edited

Loading

RaymondLi0 commented Nov 21, 2025 •

edited

Loading