Add initial RunContainer Task support #942

treblereel · 2025-11-03T06:28:03Z

Many thanks for submitting your Pull Request ❤️!

What this PR does / why we need it:

Special notes for reviewers:

Additional information (if needed):

...r/src/main/java/io/serverlessworkflow/impl/container/executors/StringExpressionResolver.java

fjtirado

Requested changes are reflected in this PR treblereel#3

fjtirado

Great work!
Before merging I think @ricardozanini should review this container handling block https://github.com/serverlessworkflow/sdk-java/pull/942/files#diff-124b1b11448a7a123f2b5b8b30d8a10568d3cdbb69aed81169350425ee4ea070R139-R199

ricardozanini

There are small comments. I was going to write on every line, but I decided to pair my analysis with AI. Report generated with assistance:

High-impact issues (bugs / logic gaps)

Timeout path returns success (0) instead of failing
In waitAccordingToLifetime, when policy == EVENTUALLY and awaitStatusCode(timeout) times out (or throws), you safeStop(id) and then fall through to the method’s final return 0;. That silently treats timeouts as success.

Fix: detect timeout and throw a clear TimeoutException (or your domain error) with the configured timeout and container name/id. Only return 0 on actual exit code 0.

policy may be null → NPE risk
If the workflow’s ContainerLifetime is absent, policy is never set. You later compare if (policy == ContainerCleanupPolicy.EVENTUALLY) ….

Fix: set a default policy in the builder (e.g., IMMEDIATELY), or guard against null.

Misleading exception messages (“exit code”) for non-exit errors
Catching InterruptedException / IOException and throwing “failed with exit code …” is confusing.

Fix: use accurate messages: “interrupted while executing container …”, “I/O error while …”, etc. Keep exit code language only for real exit codes.

Always pulling images = slow & noisy
pullImageIfNeeded always pulls. That’s expensive and brittle offline.

Fix: inspectImageCmd(image) first; only pull on NotFoundException. Consider a flag to “always pull” vs “if not present”.

Tags/digests handling
NameParser.parseRepositoryTag + defaulting to latest won’t handle digests (@sha256:…) correctly and may fetch unexpected tags.

Fix: detect digests and call pullImageCmd(repo).withTag(tag) only when a tag exists; for digests use withRepository + withTag appropriately or avoid re-tagging entirely.

Timeout result handling
awaitStatusCode(timeout) returns Integer (nullable). You don’t check null before using it.

Fix: treat null as timeout and fail (don’t return 0).

Auto-remove mismatch
Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers.

Fix: ensure HostConfig.withAutoRemove(true) is always set for short-lived tasks, or make the cleanup policy explicit.

Robustness & operability

Logs on failure
When a container exits non-zero, you throw a mapped error but don’t surface container logs—which makes debugging painful.

Recommendation: on non-zero (or timeout), stream/capture the last N KB of stdout/stderr (logContainerCmd(id).withStdOut(true).withStdErr(true).withTail(n)), and include a trimmed snippet in the exception (and full logs in your engine’s diagnostics).

Resource limits & safety rails
No CPU/memory/FS caps → a user container can hog the host.

Recommendation: support resource settings (CPU quota/period, cpus, memory/memorySwap, pidsLimit, read-only rootfs, drop capabilities, seccomp/apparmor profiles). Default to conservative limits; allow workflow overrides in a controlled/allow-listed way.
This one we might have to adjust on upstream specification

Image allow-list / registry auth
Consider an allow-list of images/prefixes and registry credentials. Otherwise, anyone can run :latest from anywhere.

Recommendation: resolve images through a policy component; support private registry auth via AuthConfig.
We might have to add a config to the engine core such as sw.impl.containers.policy - @fjtirado

Networking defaults
Decide your default network mode (none/bridge/custom). Workflows often don’t need internet access.

Recommendation: default to a locked-down network, only open what the task needs.

Graceful stop semantics
safeStop(id, 10s) is good, but for stubborn containers a kill fallback may be needed if stop doesn’t terminate.

Recommendation: if still running after .stop() + wait, send kill (SIGKILL) with a short deadline before remove.

Races around isRunning / remove
Between inspect and stop/remove the state may change; your try/catch(ignore) masks real issues—but that’s fine for cleanup. Consider logging at debug so post-mortems are possible.
Singleton Docker client lifecycle
DockerClientHolder is static and never closed.

Recommendation: register a shutdown hook / application lifecycle hook to close the HTTP client; or rely on process lifetime but document it.

Threading & pool usage
CompletableFuture.supplyAsync(..., executorService()) is correct, but pulls/waits can be long. Ensure your executor is sized for this (don’t starve other workflow tasks). Consider a dedicated pool for container jobs.
Return value & output contract
startSync returns the same input if exit is 0. If containers are expected to produce artifacts/logs/files, make that contract explicit (e.g., volumes or captured stdout). If stdout should become the task output, wire it.
Port/volume validation
Your property setters likely map ports/volumes directly. Validate host paths (no / escape tricks), read-only when possible, and sanitize port ranges.
Name & label hygiene
Add consistent labels (workflow id, task id, run id) and a predictable Name. This helps you find & clean orphans if the engine crashes.

e.g. labels: sw.workflow=, sw.instance=, sw.task=, sw.run=.

Image “latest” pitfalls
Defaulting to latest is reproducibility-hostile.

Recommendation: require explicit tags in production; optionally enforce via validation.

Map more exit codes carefully
Your map is fine, but consider surfacing exit status + signal from inspect to avoid mislabeling (e.g., 137 might be OOMKill vs SIGKILL). You can fetch inspectContainerCmd(id).getState().getOOMKilled() and include that.

I reviewed and changed every item on this report.

impl/container/pom.xml

ricardozanini · 2025-11-07T16:46:38Z

impl/test/src/test/java/io/serverlessworkflow/impl/test/ContainerTest.java

+    assertTrue(ports.containsKey(ExposedPort.tcp(8881)));
+    assertTrue(ports.containsKey(ExposedPort.tcp(8882)));
+
+    dockerClient.removeContainerCmd(containerId).withForce(true).exec();


Can you add this to a finally block?

The same applies to all tests; the procedure remains the same.

ricardozanini · 2025-11-07T16:49:10Z

impl/test/src/test/resources/workflows-samples/container/container-cleanup-default.yaml

+  - runContainer:
+      run:
+        container:
+          image: alpine:latest


We can use busybox instead: https://hub.docker.com/_/busybox

It's less than 1MB.

fjtirado · 2025-11-07T17:16:46Z

Some comments on the IA report

Timeout result handling
awaitStatusCode(timeout) returns Integer (nullable). You don’t check null before using it.
Fix: treat null as timeout and fail (don’t return 0).

IA is not perfect ;), I already grasped this one, the Integer (null or not) is passed to the upper methods, which checks for null in startSync method

Threading & pool usage
CompletableFuture.supplyAsync(..., executorService()) is correct, but pulls/waits can be long. Ensure your executor is sized for this (don’t starve other workflow tasks). Consider a dedicated pool for container jobs.

hmmm, picky IA, the answer is not, we are not going to create specific executors for potentially long tasks.

Return value & output contract
startSync returns the same input if exit is 0. If containers are expected to produce artifacts/logs/files, make that contract explicit (e.g., volumes or captured stdout). If stdout should become the task output, wire it.

Good point that I believe we discussed briefly offline. I do not see it included here

fjtirado · 2025-11-07T17:18:57Z

policy may be null → NPE risk
If the workflow’s ContainerLifetime is absent, policy is never set. You later compare if (policy == ContainerCleanupPolicy.EVENTUALLY) ….
Fix: set a default policy in the builder (e.g., IMMEDIATELY), or guard against null.

This in intentional, the == comparison is not going to cause a NPE

fjtirado · 2025-11-07T17:22:15Z

Image allow-list / registry auth
Consider an allow-list of images/prefixes and registry credentials. Otherwise, anyone can run :latest from anywhere.
Recommendation: resolve images through a policy component; support private registry auth via AuthConfig.
We might have to add a config to the engine core such as sw.impl.containers.policy - @fjtirado

I do not exactly see what the issue is, but in any case we already have a configuration object .
If we want (because the IA mention it ;)) to define which images can be run on advance, we can add a property with the list of images allowed, but I do not see a real reason to do that and I do not see this affecting the core engine ;)

fjtirado · 2025-11-07T17:28:19Z

Graceful stop semantics
safeStop(id, 10s) is good, but for stubborn containers a kill fallback may be needed if stop doesn’t terminate.

I do not think hardcoded timeouts are ok, this should be read from config. Also, probably the stop container can be run asynchronously, but this we can leave to a follow-up PR,. As first version is more than fine

fjtirado · 2025-11-07T17:37:13Z

Image “latest” pitfalls
Defaulting to latest is reproducibility-hostile.
Recommendation: require explicit tags in production; optionally enforce via validation.

We are defaulting to latest only if user has not set anything. I do not think the alternative, force him to include the version in the image name or throw exception, is more desirable than the current approach. As we usually said, we should prioritize user experience.

fjtirado · 2025-11-07T17:43:08Z

Singleton Docker client lifecycle
DockerClientHolder is static and never closed.

This is an intentional design choice. This is lazily loaded by the JVM the first time is used and keep alive till the process is done (since the WorkflowApplication should be tied with the proces lifecycle, it should be fine)
But.... we probably should allow users to be able to change the default client, like we are doing now with the Http client, take a look to https://github.com/serverlessworkflow/sdk-java/blob/main/impl/http/src/main/java/io/serverlessworkflow/impl/executors/http/HttpClientResolver.java#L32-L37 and WorkflowApplication.withAdditionalObject

fjtirado · 2025-11-07T17:49:45Z

Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers.

Yes, a property setter is setting it when policy is ALWAYS or NEVER

ricardozanini · 2025-11-07T18:38:37Z

We are defaulting to latest only if user has not set anything. I do not think the alternative, force him to include the version in the image name or throw exception, is more desirable than the current approach. As we usually said, we should prioritize user experience.

I agree, we can keep using latest, although it is not recommended. However, everyone defaults to latest when emptying.

ricardozanini · 2025-11-07T18:40:38Z

This in intentional, the == comparison is not going to cause a NPE

But the point is to set a default, I think it's a fair game.

…ow#963) Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Co-authored-by: Ricardo Zanini <1538000+ricardozanini@users.noreply.github.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

treblereel requested a review from fjtirado as a code owner November 3, 2025 06:28

treblereel self-assigned this Nov 3, 2025

treblereel added the java Pull requests that update java code label Nov 3, 2025

treblereel force-pushed the ContainerTask branch 2 times, most recently from 4384a68 to a8f5dd5 Compare November 4, 2025 02:36

fjtirado reviewed Nov 4, 2025

View reviewed changes

...r/src/main/java/io/serverlessworkflow/impl/container/executors/StringExpressionResolver.java Outdated Show resolved Hide resolved

fjtirado self-requested a review November 5, 2025 13:31

fjtirado requested changes Nov 5, 2025

View reviewed changes

fjtirado mentioned this pull request Nov 5, 2025

Review comments treblereel/serverless-workflow-sdk-java#3

Merged

fjtirado self-requested a review November 6, 2025 10:14

fjtirado approved these changes Nov 6, 2025

View reviewed changes

fjtirado requested a review from ricardozanini November 6, 2025 12:41

ricardozanini requested changes Nov 7, 2025

View reviewed changes

treblereel force-pushed the ContainerTask branch 2 times, most recently from 03d97ac to 0728b37 Compare November 7, 2025 20:21

fjtirado and others added 6 commits November 7, 2025 12:21

[Fix serverlessworkflow#933] Adding timeout support (serverlessworkfl…

588090e

…ow#963) Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

[Fix serverlessworkflow#932] Workflow scheduler (serverlessworkflow#966)

83f6b68

Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Add initial RunContainer Task support

118f172

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

image pull before run

a29faa6

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

refactoring + tests

f9c5d18

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Review comments

a908e3e

Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

fjtirado and others added 3 commits November 7, 2025 12:21

Disable test if docker is not

5408621

Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

Update impl/container/pom.xml

59ef9c6

Co-authored-by: Ricardo Zanini <1538000+ricardozanini@users.noreply.github.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

post review

f8c70c6

Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>

treblereel force-pushed the ContainerTask branch from 0728b37 to f8c70c6 Compare November 7, 2025 20:22

Merge branch 'main' into ContainerTask

3c47909

Add initial RunContainer Task support #942

Are you sure you want to change the base?

Add initial RunContainer Task support #942

Uh oh!

Conversation

treblereel commented Nov 3, 2025

Uh oh!

Uh oh!

fjtirado left a comment

Choose a reason for hiding this comment

Uh oh!

fjtirado left a comment

Choose a reason for hiding this comment

Uh oh!

ricardozanini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

High-impact issues (bugs / logic gaps)

Robustness & operability

Uh oh!

Uh oh!

ricardozanini Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ricardozanini Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ricardozanini Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

fjtirado commented Nov 7, 2025

Uh oh!

fjtirado commented Nov 7, 2025

Uh oh!

fjtirado commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjtirado commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjtirado commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjtirado commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjtirado commented Nov 7, 2025

Uh oh!

ricardozanini commented Nov 7, 2025

Uh oh!

ricardozanini commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ricardozanini left a comment •

edited

Loading

fjtirado commented Nov 7, 2025 •

edited

Loading

fjtirado commented Nov 7, 2025 •

edited

Loading

fjtirado commented Nov 7, 2025 •

edited

Loading

fjtirado commented Nov 7, 2025 •

edited

Loading