Skip to content

Conversation

@v0i0
Copy link
Contributor

@v0i0 v0i0 commented Nov 12, 2025

  (Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV)    eager-gbps    compile-gbps    fla-gbps    tilelang-gbps    helion_helion_gdn_fwd_h_tb-gbps
---------------------------------------------------  ------------  --------------  ----------  ---------------  ---------------------------------
                           (1, 6, 1024, 64, 256, 2)       3.92615         7.40171     373.228          261.995                            439.151
                                            average       3.92615         7.40171     373.228          261.995                            439.151

@v0i0 v0i0 requested a review from yf225 November 12, 2025 21:48
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 12, 2025
@tzj-fxz
Copy link

tzj-fxz commented Nov 13, 2025

Hi! I have tested this kernel with some large-seqlen configs (with 32k seqlen and different heads, typically (1, 32, 32768, 128, 1)). The autotune process takes a lot of time and finally gives the best config, but the accuracy is zero.

Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
Module          FLOP    % Total
-----------  -------  ---------
Global       68.719B    100.00%
 - aten.bmm  68.719B    100.00%
[tritonbench] Output result csv to /tmp/tmp9y222nql.csv
(Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV);eager-tflops;eager-gbps;eager-latency;compile-speedup;compile-accuracy;compile-tflops;compile-gbps;compile-latency;helion_helion_gdn_fwd_h_tb-speedup;helion_helion_gdn_fwd_h_tb-accuracy;helion_helion_gdn_fwd_h_tb-tflops;helion_helion_gdn_fwd_h_tb-gbps;helion_helion_gdn_fwd_h_tb-latency
(1, 32, 32768, 64, 128, 1);0.6079966534761014;7.7558166855777095;113.02607727050781;0.814987395509034;0.0;0.49550960909469655;6.320892840624485;138.6844482421875;142.42771271025643;0.0;86.59557269009149;1104.6432307268751;0.7935680150985718
average;0.6079966534761014;7.7558166855777095;;0.814987395509034;0.0;0.49550960909469655;6.320892840624485;;142.42771271025643;0.0;86.59557269009149;1104.6432307268751;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[32], indexing=['tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last', 'first', '', 'first'], loop_orders=[[0, 1, 2]], num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[1, 1], range_warp_specializes=[]), static_shapes=True)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@v0i0
Copy link
Contributor Author

v0i0 commented Nov 13, 2025

Hey @tzj-fxz thank you for checking it out! Initialization & error checking for this kernel seems tricky, did you see similar issues for the other implementations in tritonbench / other shapes? I wonder if a more robust way to generate the inputs would be to capture them out of FLA or something else.

@tzj-fxz
Copy link

tzj-fxz commented Nov 14, 2025

Yes. I have tested several seqlen (4k, 8k, 16k, 32k) with other configs frozen. There is always the same error message showing that the accuracy check failed. BTW 1k and 2k cases run successfully. :)

Hey @tzj-fxz thank you for checking it out! Initialization & error checking for this kernel seems tricky, did you see similar issues for the other implementations in tritonbench / other shapes? I wonder if a more robust way to generate the inputs would be to capture them out of FLA or something else.

@v0i0
Copy link
Contributor Author

v0i0 commented Nov 14, 2025

I see. I suspect this is just an issue with our chosen reference implementation, here is a run with lots of kernel. note how FLA and helion match exactly in their accuracy field.

  (Batch, Heads, SeqLen, ChunkSize, Dhead, ExpandV)    eager-gbps    compile-accuracy    compile-gbps    fla-accuracy    fla-gbps    tilelang-accuracy    tilelang-gbps    helion_helion_gdn_fwd_h_tb-accuracy    helion_helion_gdn_fwd_h_tb-gbps
---------------------------------------------------  ------------  ------------------  --------------  --------------  ----------  -------------------  ---------------  -------------------------------------  ---------------------------------
                           (1, 6, 1024, 64, 256, 2)       3.97343            1                7.32136             1       371.538                    0          261.161                                    1                              435.653
                           (1, 6, 2048, 64, 256, 2)       3.68888            0               10.1702              1       407.355                    0          281.227                                    1                              489.109
                           (1, 6, 4096, 64, 256, 2)       3.72543            0                8.47115             0       429.852                    0          293.247                                    0                              527.357
                          (16, 6, 1024, 64, 256, 2)      29.2867             0               40.8267              1       590.805                    0          628.67                                     1                              710.052
                          (16, 6, 2048, 64, 256, 2)      29.3871             0               17.4041              0       574.055                    0          648.2                                      0                              515.333
                          (16, 6, 4096, 64, 256, 2)      29.3519             0                9.8477              0       558.972                    0          649.403                                    0                              518.932
                                            average      16.5689             0.166667        15.6735              0.5     488.763                    0          460.318                                    0.5                            532.739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants