-
Notifications
You must be signed in to change notification settings - Fork 617
[FLUX] Add FLUX inference test in CI #1969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # Create mapping from local indices to global prompt indices | ||
| global_ids = list(range(global_rank, total_prompts, world_size)) | ||
| if single_prompt_mode: | ||
| # In single prompt mode, all ranks process the same prompt (index 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I somehow feel this PR is doing some overkill stuff. Would it be simpler if we just provide a truncated version of prompts.txt in test assets, say 1/2 prompts on each rank? Our goal is to make CI job lighter. Users should be fine if they always have to specify prompts in a .txt file? I just feel dealing with two paths doesn't seem to be elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That would be easier. I think specify prompts in a .txt file is easier, I will remove the single prompt path. However, there are a minor bug to fix: If the number of prompts < number of ranks, some rank will get 0 prompts. During forward path of T5/clips encoder, the program will hang because FSDP is applied on encoder, and some ranks didn't run forward so all_gather will hang.
I will modify the PR to fix the bug and always using prompts.txt file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I think we can just error out early in that case, for simplicity.
882f29e to
8205bfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing all of them, we can create a test asset with only small portion of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using smaller validation set for integration test
where are you doing this?
also you mentioned some difficulties when combining CP with validation, is it no longer a blocker?
Here I updated the validation.steps = 5 (default is 48 , which is calculated from the validation dataset size)
The error message is NCCL timeout during generating images, then I figure out it happens occasionally in both CI docker and local dev machine. So I thought it could because each rank takes different inputs thus different generation speed. |
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
Model test time breakdown:

Further time breakdown load the "flux validation test" is:
In this PR, following changes are being made to save time: