Skip to content

Commit 034828c

Browse files
cleanup
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
1 parent a74fa01 commit 034828c

File tree

1 file changed

+3
-4
lines changed

1 file changed

+3
-4
lines changed

_posts/2025-11-27-improved-cuda-debugging.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -327,7 +327,7 @@ result = ops.cutlass_scaled_mm(
327327
print(result)
328328
```
329329

330-
Following the same steps as before we first rebuild vLLM with lineinfo; If vLLM was installed via an editable install (i.e. `-e .`) this can be done using:
330+
Following the same steps as before we first rebuild vLLM with lineinfo; if vLLM was installed via an editable install (i.e. `-e .`) this can be done using:
331331

332332
```bash
333333
NVCC_PREPEND_FLAGS="-lineinfo" python setup.py build_ext --inplace
@@ -404,7 +404,7 @@ This reveals a deep inline call chain:
404404
/*7f5687bbb580*/ UTMALDG.3D [UR8], [UR14], desc[UR16] ;
405405
```
406406

407-
Now we can trace the issue back through the full call chain — from ptx instruction we saw before all the way up to where it is instantiated in vLLM. Following the call chain we can get to a contextually useful line, in this case that is in CUTLASS's collective mainloop (`sm90_mma_tma_gmma_ss_warpspecialized.hpp`):
407+
Now we can trace the issue back through the full call chain — from ptx instruction we saw before all the way to the device_kernel entry point. Following the call chain we can get to a contextually useful line, in this case that is in CUTLASS's collective mainloop (`sm90_mma_tma_gmma_ss_warpspecialized.hpp`):
408408
```c++
409409
copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
410410
```
@@ -416,8 +416,7 @@ This is more helpful as it informs us the issue is with loading the A matrix spe
416416
417417
## Conclusion
418418
419-
This blog post introduced two advanced debugging techniques for CUDA kernels. The first technique uses user-triggered core dumps to identify hanging kernels, while the second traces complex kernels back to their source code by leveraging line information embedded in the compiled binary. These techniques are powerful tools for debugging complex issues in CUDA kernels, especially illegal memory access problems.
420-
Using both the `user induced GPU core dump generation` and `nvdisasm` techniques we were able to recently debug a hard-to-reproduce and tricky hang in the CUTLASS MLA attention backend: https://github.com/vllm-project/vllm/pull/26026 (this bug actually stemmed from the upstream CUTLASS code example and has since been fixed in [v4.3.0](https://github.com/NVIDIA/cutlass/commit/b1d6e2c9b334dfa811e4183dfbd02419249e4b52)).
419+
This blog post introduced two advanced debugging techniques for CUDA kernels. The first technique uses user-triggered core dumps to identify hanging kernels, while the second traces complex kernels back to their source code by leveraging line information embedded in the compiled binary. These techniques are powerful tools for debugging complex issues in CUDA kernels, especially illegal memory access problems. Using both in tandem we were able to recently debug a hard-to-reproduce and tricky hang in the CUTLASS MLA attention backend: https://github.com/vllm-project/vllm/pull/26026 (this bug actually stemmed from the upstream CUTLASS code example and has since been fixed in [v4.3.0](https://github.com/NVIDIA/cutlass/commit/b1d6e2c9b334dfa811e4183dfbd02419249e4b52)).
421420
422421
The vLLM project aims to provide easy, fast, stable, and affordable LLM serving for everyone, and accessible debugging is an important aspect of this mission. We will continue to share more debugging tips and techniques in the future to build a strong LLM inference ecosystem together. To share your story or usage with vLLM, please submit a PR at [the blogpost repository](https://github.com/vllm-project/vllm-project.github.io).
423422

0 commit comments

Comments
 (0)