Skip to content

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Oct 14, 2025

Summary

This PR unifies exit code handling behavior across all cloud provider executors (AWS Batch, Azure Batch, Google Batch, and Kubernetes). Previously, different executors had inconsistent approaches to obtaining task exit codes, which led to issues like missing Fusion exit codes (#6481) and unnecessary I/O overhead (#6445).

Changes

Unified Exit Code Strategy

All cloud providers now follow a consistent two-step approach:

  1. Primary: Get exit code from the scheduler/cloud API
  2. Fallback: Read .exitcode file only when API returns null (not when it returns 0)

Fixes by Provider

Provider Before After
Google Batch Read exit code only from .exitcode file Get from API (taskExecution.exitCode), fallback to file if null
AWS Batch API with fallback to file when exit code was 0 or null API with fallback only when null
Azure Batch API with fallback to file when exit code was 0 or null API with fallback only when null
Kubernetes API with fallback to file when exit code was 0 or null API with fallback only when null

Benefits

  • Correct exit codes: Fusion exit codes (174/175) and other scheduler-reported failures are now properly captured from Google Batch API
  • Reduced I/O: Eliminates unnecessary .exitcode file reads for successful tasks (exit code 0)
  • Better scalability: Particularly beneficial for workloads with many fine-grained jobs
  • Lower storage costs: Reduces remote file storage access (S3, Azure Blob, GCS)
  • Consistent behavior: All cloud executors now behave the same way

Closes

Test Coverage

Added comprehensive unit tests for all affected handlers verifying:

  • Exit code from API is used when available
  • Fallback to .exitcode file when API returns null
  • No fallback when API returns 0 (success)
  • Error exit codes (non-zero) are correctly captured

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link

netlify bot commented Oct 14, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit a2114e9
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68ef94ccf3592c0008c0b24a

@jorgee
Copy link
Contributor Author

jorgee commented Oct 14, 2025

@bentsherman @pditommaso the google batch task handler is directly getting the exit code form the file (#6481). This PR also removes the fallback to .exitcode when the exit code is 0. It is not a big PR but I was wondering if you prefer to split it in two PRs to facilitate backport to stable versions. The fix for #6481 is just the first commit ea1aa48

@jorgee jorgee changed the title 6445 optimize exit code handling by relying on scheduler status for successful executions Optimize exit code handling by relying on scheduler status for successful executions Oct 14, 2025
jorgee and others added 3 commits October 15, 2025 12:38
…g-on-scheduler-status-for-successful-executions
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee marked this pull request as ready for review October 15, 2025 13:39
@bentsherman bentsherman added this to the 25.10 milestone Oct 16, 2025
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but i'd keep post 25.10

@bentsherman
Copy link
Member

@jorgee we talked and agreed to include the google and k8s fixes in 25.10 and merge the rest of this PR in the next edge release

@bentsherman bentsherman modified the milestones: 25.10, 26.04 Oct 21, 2025
@pditommaso pditommaso merged commit 454a2ae into master Nov 28, 2025
25 checks passed
@pditommaso pditommaso deleted the 6445-optimize-exit-code-handling-by-relying-on-scheduler-status-for-successful-executions branch November 28, 2025 09:32
pditommaso added a commit that referenced this pull request Nov 28, 2025
Fix test method names introduced in PR #6484:
- deletePodIfSuccessful -> deleteJobIfSuccessful
- savePodLogOnError -> saveJobLogOnError

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize exit code handling by relying on scheduler status for successful executions

4 participants