Skip to content

Conversation

@wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Oct 30, 2025

Model test time breakdown:
image

Further time breakdown load the "flux validation test" is:

  • model initialization takes 2-3min (Flux model + T5-xxl encoder, CLIPs encoder)
  • validation run (validation steps=default, each step validates the model, generate image at step 1,5,10): 5min30s

In this PR, following changes are being made to save time:

  1. merge FLUX related test into one time to save FLUX model setup time
  2. Reduce validation step to be 5, using smaller validation set for integration test

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2025
@wwwjn wwwjn changed the title [FLUXAdd single prompt mode for FLUX inference and inference CI test [FLUX] Add single prompt mode for FLUX inference and inference CI test Oct 30, 2025
# Create mapping from local indices to global prompt indices
global_ids = list(range(global_rank, total_prompts, world_size))
if single_prompt_mode:
# In single prompt mode, all ranks process the same prompt (index 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I somehow feel this PR is doing some overkill stuff. Would it be simpler if we just provide a truncated version of prompts.txt in test assets, say 1/2 prompts on each rank? Our goal is to make CI job lighter. Users should be fine if they always have to specify prompts in a .txt file? I just feel dealing with two paths doesn't seem to be elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That would be easier. I think specify prompts in a .txt file is easier, I will remove the single prompt path. However, there are a minor bug to fix: If the number of prompts < number of ranks, some rank will get 0 prompts. During forward path of T5/clips encoder, the program will hang because FSDP is applied on encoder, and some ranks didn't run forward so all_gather will hang.

I will modify the PR to fix the bug and always using prompts.txt file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I think we can just error out early in that case, for simplicity.

@wwwjn wwwjn force-pushed the flux-tests branch 2 times, most recently from 882f29e to 8205bfd Compare November 19, 2025 06:51
@wwwjn wwwjn changed the title [FLUX] Add single prompt mode for FLUX inference and inference CI test [FLUX] Add FLUX inference test in CI Nov 19, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing all of them, we can create a test asset with only small portion of it.

@wwwjn wwwjn requested a review from tianyu-l November 25, 2025 14:45
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using smaller validation set for integration test

where are you doing this?

also you mentioned some difficulties when combining CP with validation, is it no longer a blocker?

@wwwjn
Copy link
Contributor Author

wwwjn commented Nov 25, 2025

using smaller validation set for integration test

where are you doing this?

Here I updated the validation.steps = 5 (default is 48 , which is calculated from the validation dataset size)

also you mentioned some difficulties when combining CP with validation, is it no longer a blocker?

The error message is NCCL timeout during generating images, then I figure out it happens occasionally in both CI docker and local dev machine. So I thought it could because each rank takes different inputs thus different generation speed.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@wwwjn wwwjn merged commit cbdb311 into main Nov 25, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants