Rtx 2060 qwen image edit black output *never used sageattention*

### Custom Node Testing

- [ ] I have tried disabling custom nodes and the issue persists (see [how to disable custom nodes](https://docs.comfy.org/troubleshooting/custom-node-issues#step-1%3A-test-with-all-custom-nodes-disabled) if you need help)

### Expected Behavior

Image able to produce normal preview from step 2 of inference & normal output saved after vae decode

### Actual Behavior

Preview is black thruout inference & only black output is generated

### Steps to Reproduce

1. Install [calcuis/gguf](https://github.com/calcuis/gguf) [city96/ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) extension for gguf loading and [pollockjj/ComfyUI-MultiGPU](https://github.com/pollockjj/ComfyUI-MultiGPU) for manual offloading, **offloading ext is necessary** to infer on my device as disabling will consistently OOM, gguf might be replaced by safetensors on another device & would likely still replicate since it's unlikely te at fault
2. Obtain model files: [diffusion model](https://huggingface.co/Phil2Sat/Qwen-Image-Edit-Rapid-AIO-GGUF/blob/main/v53/qwen-rapid-nsfw-v5.3-Q4_K_S.gguf), [te](huggingface.co/chatpig/qwen2.5-vl-7b-it-gguf/blob/main/qwen2.5-vl-7b-it-iq4_xs.gguf) [te mmproj](https://huggingface.co/chatpig/qwen2.5-vl-7b-it-gguf/blob/main/mmproj-qwen2.5-vl-7b-it-q4_0.gguf), [vae](https://huggingface.co/QuantStack/Qwen-Image-Edit-GGUF/blob/main/VAE/Qwen_Image-VAE.safetensors), [lightning lora](https://huggingface.co/lightx2v/Qwen-Image-Lightning/blob/main/Qwen-Image-Edit-2509/Qwen-Image-Edit-2509-Lightning-4steps-V1.0-fp32.safetensors)
3. `python main.py --preview-method auto --force-fp16 --disable-cuda-malloc --windows-standalone-build`
4. Run workflow from the screencap: (I avoid uploading image with metadata or json)

<img width="1504" height="836" alt="Image" src="https://github.com/user-attachments/assets/66ee3136-c375-4088-ace7-0fc3e01fdc63" />

### Debug Logs

```powershell
Console (cmd in conda environment):

(comfyui) PS D:\webui-forge\ComfyUI_windows_portable\ComfyUI> .\run

D:\webui-forge\ComfyUI_windows_portable\ComfyUI>python main.py --preview-method auto --force-fp16 --disable-cuda-malloc --windows-standalone-build
Adding extra search path checkpoints D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\Stable-diffusion
Adding extra search path configs D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\Stable-diffusion
Adding extra search path vae D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\VAE
Adding extra search path loras D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\Lora
Adding extra search path loras D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\LyCORIS
Adding extra search path upscale_models D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\ESRGAN
Adding extra search path upscale_models D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\RealESRGAN
Adding extra search path upscale_models D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\SwinIR
Adding extra search path embeddings D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\embeddings
Adding extra search path hypernetworks D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\hypernetworks
Adding extra search path controlnet D:\webui-forge\ComfyUI_windows_portable\ComfyUI\path\to\stable-diffusion-webui\models\ControlNet
Adding extra search path checkpoints D:\webui-forge\webui\models\Stable-diffusion
Adding extra search path configs D:\webui-forge\webui\models\Stable-diffusion
Adding extra search path vae D:\webui-forge\webui\models\VAE
Adding extra search path loras D:\webui-forge\webui\models\Lora
Adding extra search path loras D:\webui-forge\webui\models\LyCORIS
Adding extra search path upscale_models D:\webui-forge\webui\models\ESRGAN
Adding extra search path upscale_models D:\webui-forge\webui\models\RealESRGAN
Adding extra search path upscale_models D:\webui-forge\webui\models\SwinIR
Adding extra search path embeddings D:\webui-forge\webui\embeddings
Adding extra search path hypernetworks D:\webui-forge\webui\models\hypernetworks
Adding extra search path controlnet D:\webui-forge\webui\models\ControlNet
Adding extra search path clip D:\webui-forge\webui\models\text_encoder

Checkpoint files will always be loaded safely.
Total VRAM 6144 MB, total RAM 32605 MB
pytorch version: 2.8.0+cu126
xformers version: 0.0.32.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 2060 : native
Enabled pinned memory 14672.0
Using xformers attention
Python version: 3.11.13 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:03:15) [MSC v.1929 64 bit (AMD64)]
ComfyUI version: 0.3.68
ComfyUI frontend version: 1.28.8
[Prompt Server] web root: C:\Users\USER\anaconda3\envs\comfyui\Lib\site-packages\comfyui_frontend_package\static
2025-11-06 18:27:17.392247: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
From C:\Users\USER\anaconda3\envs\comfyui\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

Unable to parse pyproject.toml due to lack dependency pydantic-settings, please run 'pip install -r requirements.txt': Illegal character '\n' (at line 3, column 101)
ComfyUI-GGUF: Allowing full torch compile
[MultiGPU Core Patching] Patching mm.soft_empty_cache for Comprehensive Memory Management (VRAM + CPU + Store Pruning)
[MultiGPU Core Patching] Patching mm.get_torch_device, mm.text_encoder_device, mm.unet_offload_device
[MultiGPU DEBUG] Initial current_device: cuda:0
[MultiGPU DEBUG] Initial current_text_encoder_device: cuda:0
[MultiGPU DEBUG] Initial current_unet_offload_device: cpu
[MultiGPU] Initiating custom_node Registration. . .
-----------------------------------------------
custom_node                   Found     Nodes
-----------------------------------------------
ComfyUI-LTXVideo                  N         0
ComfyUI-Florence2                 N         0
ComfyUI_bitsandbytes_NF4          N         0
x-flux-comfyui                    N         0
ComfyUI-MMAudio                   N         0
ComfyUI-GGUF                      Y        18
PuLID_ComfyUI                     N         0
ComfyUI-WanVideoWrapper           N         0
-----------------------------------------------
[MultiGPU] Registration complete. Final mappings: CheckpointLoaderAdvancedMultiGPU, CheckpointLoaderAdvancedDisTorch2MultiGPU, UNetLoaderLP, UNETLoaderMultiGPU, VAELoaderMultiGPU, CLIPLoaderMultiGPU, DualCLIPLoaderMultiGPU, TripleCLIPLoaderMultiGPU, QuadrupleCLIPLoaderMultiGPU, CLIPVisionLoaderMultiGPU, CheckpointLoaderSimpleMultiGPU, ControlNetLoaderMultiGPU, DiffusersLoaderMultiGPU, DiffControlNetLoaderMultiGPU, UNETLoaderDisTorch2MultiGPU, VAELoaderDisTorch2MultiGPU, CLIPLoaderDisTorch2MultiGPU, DualCLIPLoaderDisTorch2MultiGPU, TripleCLIPLoaderDisTorch2MultiGPU, QuadrupleCLIPLoaderDisTorch2MultiGPU, CLIPVisionLoaderDisTorch2MultiGPU, CheckpointLoaderSimpleDisTorch2MultiGPU, ControlNetLoaderDisTorch2MultiGPU, DiffusersLoaderDisTorch2MultiGPU, DiffControlNetLoaderDisTorch2MultiGPU, UnetLoaderGGUFDisTorchMultiGPU, UnetLoaderGGUFAdvancedDisTorchMultiGPU, CLIPLoaderGGUFDisTorchMultiGPU, DualCLIPLoaderGGUFDisTorchMultiGPU, TripleCLIPLoaderGGUFDisTorchMultiGPU, QuadrupleCLIPLoaderGGUFDisTorchMultiGPU, UnetLoaderGGUFDisTorch2MultiGPU, UnetLoaderGGUFAdvancedDisTorch2MultiGPU, CLIPLoaderGGUFDisTorch2MultiGPU, DualCLIPLoaderGGUFDisTorch2MultiGPU, TripleCLIPLoaderGGUFDisTorch2MultiGPU, QuadrupleCLIPLoaderGGUFDisTorch2MultiGPU, UnetLoaderGGUFMultiGPU, UnetLoaderGGUFAdvancedMultiGPU, CLIPLoaderGGUFMultiGPU, DualCLIPLoaderGGUFMultiGPU, TripleCLIPLoaderGGUFMultiGPU, QuadrupleCLIPLoaderGGUFMultiGPU
Nvidia APEX normalization not installed, using PyTorch LayerNorm

Import times for custom nodes:
   0.0 seconds: D:\webui-forge\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF
   0.1 seconds: D:\webui-forge\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf
   0.1 seconds: D:\webui-forge\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-MultiGPU

Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
Using xformers attention in VAE
Using xformers attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
gguf qtypes: Q6_K (1), F32 (141), IQ4_XS (166), Q5_K (31)
Using mmproj 'mmproj-qwen2.5-vl-7b-it-q4_0.gguf' for 'qwen2.5-vl-7b-it-iq4_xs.gguf'.
gguf qtypes: Q4_0 (192), F32 (291), F16 (34), Q8_0 (2)
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanVAE
0 models unloaded.
loaded partially; 128.00 MB usable, 128.00 MB loaded, 114.00 MB offloaded, lowvram patches: 0
Requested to load QwenImageTEModel_
loaded completely; 3541.80 MB usable, 4469.60 MB loaded, full load: True
[MultiGPU Core Patching] Successfully patched ModelPatcher.partially_load
gguf qtypes: F16 (1093), Q8_0 (30), Q4_K (692), Q5_K (118)
model weight dtype torch.float16, manual cast: None
model_type FLUX
[MultiGPU DisTorch V2] Full allocation string: #cuda:0;8.0;cpu
[MultiGPU DisTorch V2] GGUFModelPatcher missing 'model_patches_models' attribute, using 'model_patches_to' fallback.
Requested to load QwenImage
===============================================
    DisTorch2 Model Virtual VRAM Analysis
===============================================
Object   Role   Original(GB) Total(GB)  Virt(GB)
-----------------------------------------------
cuda:0   recip       6.00GB   14.00GB   +8.00GB
cpu      donor      31.84GB   23.84GB   -8.00GB
-----------------------------------------------
model    model      11.62GB    3.62GB   -8.00GB


[MultiGPU DisTorch V2] Model size (11.62GB) is larger than 90% of available VRAM on: cuda:0 (5.40GB).
[MultiGPU DisTorch V2] To prevent an OOM error, set 'virtual_vram_gb' to at least 6.22.


==================================================
[MultiGPU DisTorch V2] Final Allocation String:
cuda:0,0.6040;cpu,0.2512
==================================================
    DisTorch2 Model Device Allocations
==================================================
Device    VRAM GB    Dev %   Model GB    Dist %
--------------------------------------------------
cuda:0       6.00    60.4%       3.62     31.2%
cpu         31.84    25.1%       8.00     68.8%
--------------------------------------------------
    DisTorch2 Model Layer Distribution
--------------------------------------------------
Layer Type         Layers   Memory (MB)   % Total
--------------------------------------------------
Linear                846      12010.58    100.0%
RMSNorm               241          0.07      0.0%
LayerNorm             241          0.00      0.0%
--------------------------------------------------
DisTorch2 Model Final Device/Layer Assignments
--------------------------------------------------
Device             Layers   Memory (MB)   % Total
--------------------------------------------------
cuda:0 (<0.01%)       484          0.82      0.0%
cuda:0                264       3874.96     32.3%
cpu                   580       8134.87     67.7%
--------------------------------------------------
[MultiGPU DisTorch V2] DisTorch loading completed.
[MultiGPU DisTorch V2] Total memory: 12010.65MB
100%|████████████████████████████████████████████████████████████████████████| 3/3 [01:23<00:00, 27.73s/it]
[MultiGPU DisTorch V2] ModelPatcher missing 'model_patches_models' attribute, using 'model_patches_to' fallback.
Requested to load WanVAE
0 models unloaded.
loaded partially; 128.00 MB usable, 128.00 MB loaded, 114.00 MB offloaded, lowvram patches: 0
Prompt executed in 399.30 seconds
```

### Other

This is not a duplicate of previous similar bugs with workaround: 

- Since my gpu is too old to support sageattention this incompatibility is bypassed (many threads here & on reddit point to this as the cause of a bug that has identical output)
- I never added `--fast` during inference & testing
- I tested these flag combinations `--force-fp16` `--force-fp16 --fp32-unet` `--force-fp16 --fp32-vae`, only `--force-fp16 --fp32-unet` produced a normal preview and output, this implies it's likely qwen diffusion model's issue with fp16 inference

As seen in the successful test run, the way to bypass this is `--fp32-unet`, at tremendous speed cost (~2-5x)

Possibly this is an issue with qwen model's fp16 inference, or implementation of fp16 inference in comfy, or quantization issue (might be from either fp32 or bf16), or specific rtx20 series-only fp16 issues, might need some more narrowing down

(Apologies for having to include some exts dealing with hardware constraints, someone might test on rtx 20 series cards with more vram)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rtx 2060 qwen image edit black output never used sageattention #10668

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Rtx 2060 qwen image edit black output *never used sageattention* #10668

Description

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Rtx 2060 qwen image edit black output never used sageattention #10668