[Kernels] Migrate sampling to WebGPU #737

akaashrp · 2025-11-02T05:57:42Z

Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs

No logit_bias
No logitProcessor
Applied frequency, presence, and repetition penalties
Use logprobs
No top_logprobs

v0.2.79 performance: ~38.17 decode tokens/s
Post-PR performance: ~38.99 decode tokens/s

Notes:

The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
This will likely scale better for simultaneous sampling from multiple sequences.

CharlieFRuan · 2025-11-10T20:07:41Z

src/llm_chat.ts

+          this.getTokenLogprob(sampledToken, top_logprobs!),
+        );
+      }
    } else {


quick question: why we cannot use GPU sample kernel when we logprobs is False?

IIRC, the flow for the logprobs False case involves invoking _attach_multinomial_sampling_func / parallel_sampling_from_prob, which contains i8s that are not supported by WGSL / WebGPU yet. I experimented with enabling some experimental flags at the beginning of the relevant kernels, but I wasn't able to get these to work. One thing I haven't tried yet though is replacing the int8s with some other supported datatype in line 131 here: https://github.com/apache/tvm/blob/26db8bfd7e527198f43f3cc379f404c7513a82ef/python/tvm/relax/backend/gpu_generic/sampling.py#L131C1-L132C1.

I see.. Ideally we could modify those kernels to not use i8s if the backend is WebGPU in TVM

Let's leave a TODO at the start of this else { and somewhere in this PR's description.

I suppose the else branch is the more canonical codepath, since local deployment rarely uses logprob I suppose.

But this PR is great!

CharlieFRuan · 2025-11-11T04:32:46Z

src/llm_chat.ts

    let sampledToken: number;
-    if (logprobs) {
+    let sampledTokensDevice: tvmjs.Tensor;
+    if (logprobs && _hasValue(top_p)) {


Could you remind me why we add a _hasValue(top_p) here? If a user wants logprobs but does not provide a top_p, it would go to the else branch, and thus not populating the tokenLogprobArray.

Let's set top_p to 1.0 -- the default value at the start when we are pre-processing the sampling parameters. Then we can remove this condition change

[Kernels] Migrate sampling to WebGPU

ee5a212

akaashrp requested a review from CharlieFRuan November 2, 2025 05:57

CharlieFRuan reviewed Nov 10, 2025

View reviewed changes

CharlieFRuan reviewed Nov 11, 2025

View reviewed changes

CharlieFRuan mentioned this pull request Nov 11, 2025

Roadmap #707

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernels] Migrate sampling to WebGPU #737

[Kernels] Migrate sampling to WebGPU #737

Uh oh!

akaashrp commented Nov 2, 2025

Uh oh!

CharlieFRuan Nov 10, 2025

Uh oh!

akaashrp Nov 10, 2025 •

edited

Loading

Uh oh!

CharlieFRuan Nov 11, 2025 •

edited

Loading

Uh oh!

CharlieFRuan Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Kernels] Migrate sampling to WebGPU #737

Are you sure you want to change the base?

[Kernels] Migrate sampling to WebGPU #737

Uh oh!

Conversation

akaashrp commented Nov 2, 2025

Uh oh!

CharlieFRuan Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

akaashrp Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akaashrp Nov 10, 2025 •

edited

Loading

CharlieFRuan Nov 11, 2025 •

edited

Loading