example: gated delta net fwd_h #1119

v0i0 · 2025-11-12T21:48:53Z

  (Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV)    eager-gbps    compile-gbps    fla-gbps    tilelang-gbps    helion_helion_gdn_fwd_h_tb-gbps
---------------------------------------------------  ------------  --------------  ----------  ---------------  ---------------------------------
                           (1, 6, 1024, 64, 256, 2)       3.92615         7.40171     373.228          261.995                            439.151
                                            average       3.92615         7.40171     373.228          261.995                            439.151

tzj-fxz · 2025-11-13T12:24:40Z

Hi! I have tested this kernel with some large-seqlen configs (with 32k seqlen and different heads, typically (1, 32, 32768, 128, 1)). The autotune process takes a lot of time and finally gives the best config, but the accuracy is zero.

Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
[tritonbench] Output result csv to /tmp/tmp9y222nql.csv
(Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV);eager-tflops;eager-gbps;eager-latency;compile-speedup;compile-accuracy;compile-tflops;compile-gbps;compile-latency;helion_helion_gdn_fwd_h_tb-speedup;helion_helion_gdn_fwd_h_tb-accuracy;helion_helion_gdn_fwd_h_tb-tflops;helion_helion_gdn_fwd_h_tb-gbps;helion_helion_gdn_fwd_h_tb-latency
(1, 32, 32768, 64, 128, 1);0.6079966534761014;7.7558166855777095;113.02607727050781;0.814987395509034;0.0;0.49550960909469655;6.320892840624485;138.6844482421875;142.42771271025643;0.0;86.59557269009149;1104.6432307268751;0.7935680150985718
average;0.6079966534761014;7.7558166855777095;;0.814987395509034;0.0;0.49550960909469655;6.320892840624485;;142.42771271025643;0.0;86.59557269009149;1104.6432307268751;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[32], indexing=['tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last', 'first', '', 'first'], loop_orders=[[0, 1, 2]], num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[1, 1], range_warp_specializes=[]), static_shapes=True)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

v0i0 · 2025-11-13T17:06:21Z

Hey @tzj-fxz thank you for checking it out! Initialization & error checking for this kernel seems tricky, did you see similar issues for the other implementations in tritonbench / other shapes? I wonder if a more robust way to generate the inputs would be to capture them out of FLA or something else.

tzj-fxz · 2025-11-14T10:10:36Z

Yes. I have tested several seqlen (4k, 8k, 16k, 32k) with other configs frozen. There is always the same error message showing that the accuracy check failed. BTW 1k and 2k cases run successfully. :)

Hey @tzj-fxz thank you for checking it out! Initialization & error checking for this kernel seems tricky, did you see similar issues for the other implementations in tritonbench / other shapes? I wonder if a more robust way to generate the inputs would be to capture them out of FLA or something else.

v0i0 · 2025-11-14T23:18:51Z

I see. I suspect this is just an issue with our chosen reference implementation, here is a run with lots of kernel. note how FLA and helion match exactly in their accuracy field.

  (Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV)    eager-gbps    compile-accuracy    compile-gbps    fla-accuracy    fla-gbps    tilelang-accuracy    tilelang-gbps    helion_helion_gdn_fwd_h_tb-accuracy    helion_helion_gdn_fwd_h_tb-gbps
---------------------------------------------------  ------------  ------------------  --------------  --------------  ----------  -------------------  ---------------  -------------------------------------  ---------------------------------
                           (1, 6, 1024, 64, 256, 2)       3.97343            1                7.32136             1       371.538                    0          261.161                                    1                              435.653
                           (1, 6, 2048, 64, 256, 2)       3.68888            0               10.1702              1       407.355                    0          281.227                                    1                              489.109
                           (1, 6, 4096, 64, 256, 2)       3.72543            0                8.47115             0       429.852                    0          293.247                                    0                              527.357
                          (16, 6, 1024, 64, 256, 2)      29.2867             0               40.8267              1       590.805                    0          628.67                                     1                              710.052
                          (16, 6, 2048, 64, 256, 2)      29.3871             0               17.4041              0       574.055                    0          648.2                                      0                              515.333
                          (16, 6, 4096, 64, 256, 2)      29.3519             0                9.8477              0       558.972                    0          649.403                                    0                              518.932
                                            average      16.5689             0.166667        15.6735              0.5     488.763                    0          460.318                                    0.5                            532.739

example: gdn_fwd_h

fae0328

v0i0 requested a review from yf225 November 12, 2025 21:48

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 12, 2025

yf225 mentioned this pull request Nov 12, 2025

Error: too many values to unpack in (root,) = root.body #1121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

example: gated delta net fwd_h #1119

example: gated delta net fwd_h #1119

v0i0 commented Nov 12, 2025 •

edited

Loading

Uh oh!

tzj-fxz commented Nov 13, 2025

Uh oh!

v0i0 commented Nov 13, 2025

Uh oh!

tzj-fxz commented Nov 14, 2025 •

edited

Loading

Uh oh!

v0i0 commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

example: gated delta net fwd_h #1119

Are you sure you want to change the base?

example: gated delta net fwd_h #1119

Conversation

v0i0 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzj-fxz commented Nov 13, 2025

Uh oh!

v0i0 commented Nov 13, 2025

Uh oh!

tzj-fxz commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v0i0 commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

v0i0 commented Nov 12, 2025 •

edited

Loading

tzj-fxz commented Nov 14, 2025 •

edited

Loading