-
Notifications
You must be signed in to change notification settings - Fork 49
Add initial RunContainer Task support #942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4384a68 to
a8f5dd5
Compare
...r/src/main/java/io/serverlessworkflow/impl/container/executors/StringExpressionResolver.java
Outdated
Show resolved
Hide resolved
fjtirado
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes are reflected in this PR treblereel#3
fjtirado
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Before merging I think @ricardozanini should review this container handling block https://github.com/serverlessworkflow/sdk-java/pull/942/files#diff-124b1b11448a7a123f2b5b8b30d8a10568d3cdbb69aed81169350425ee4ea070R139-R199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are small comments. I was going to write on every line, but I decided to pair my analysis with AI. Report generated with assistance:
High-impact issues (bugs / logic gaps)
- Timeout path returns success (0) instead of failing
In waitAccordingToLifetime, when policy == EVENTUALLY and awaitStatusCode(timeout) times out (or throws), you safeStop(id) and then fall through to the method’s final return 0;. That silently treats timeouts as success.
- Fix: detect timeout and throw a clear TimeoutException (or your domain error) with the configured timeout and container name/id. Only return 0 on actual exit code 0.
- policy may be null → NPE risk
If the workflow’s ContainerLifetime is absent, policy is never set. You later compare if (policy == ContainerCleanupPolicy.EVENTUALLY) ….
- Fix: set a default policy in the builder (e.g., IMMEDIATELY), or guard against null.
- Misleading exception messages (“exit code”) for non-exit errors
Catching InterruptedException / IOException and throwing “failed with exit code …” is confusing.
- Fix: use accurate messages: “interrupted while executing container …”, “I/O error while …”, etc. Keep exit code language only for real exit codes.
- Always pulling images = slow & noisy
pullImageIfNeeded always pulls. That’s expensive and brittle offline.
- Fix: inspectImageCmd(image) first; only pull on NotFoundException. Consider a flag to “always pull” vs “if not present”.
- Tags/digests handling
NameParser.parseRepositoryTag + defaulting to latest won’t handle digests (@sha256:…) correctly and may fetch unexpected tags.
- Fix: detect digests and call pullImageCmd(repo).withTag(tag) only when a tag exists; for digests use withRepository + withTag appropriately or avoid re-tagging entirely.
- Timeout result handling
awaitStatusCode(timeout) returns Integer (nullable). You don’t check null before using it.
- Fix: treat null as timeout and fail (don’t return 0).
- Auto-remove mismatch
Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers.
- Fix: ensure HostConfig.withAutoRemove(true) is always set for short-lived tasks, or make the cleanup policy explicit.
Robustness & operability
- Logs on failure
When a container exits non-zero, you throw a mapped error but don’t surface container logs—which makes debugging painful.
- Recommendation: on non-zero (or timeout), stream/capture the last N KB of stdout/stderr (logContainerCmd(id).withStdOut(true).withStdErr(true).withTail(n)), and include a trimmed snippet in the exception (and full logs in your engine’s diagnostics).
- Resource limits & safety rails
No CPU/memory/FS caps → a user container can hog the host.
- Recommendation: support resource settings (CPU quota/period, cpus, memory/memorySwap, pidsLimit, read-only rootfs, drop capabilities, seccomp/apparmor profiles). Default to conservative limits; allow workflow overrides in a controlled/allow-listed way.
- This one we might have to adjust on upstream specification
- Image allow-list / registry auth
Consider an allow-list of images/prefixes and registry credentials. Otherwise, anyone can run :latest from anywhere.
- Recommendation: resolve images through a policy component; support private registry auth via AuthConfig.
- We might have to add a config to the engine core such as
sw.impl.containers.policy- @fjtirado
- Networking defaults
Decide your default network mode (none/bridge/custom). Workflows often don’t need internet access.
- Recommendation: default to a locked-down network, only open what the task needs.
- Graceful stop semantics
safeStop(id, 10s) is good, but for stubborn containers a kill fallback may be needed if stop doesn’t terminate.
- Recommendation: if still running after .stop() + wait, send kill (SIGKILL) with a short deadline before remove.
-
Races around isRunning / remove
Between inspect and stop/remove the state may change; your try/catch(ignore) masks real issues—but that’s fine for cleanup. Consider logging at debug so post-mortems are possible. -
Singleton Docker client lifecycle
DockerClientHolder is static and never closed.
- Recommendation: register a shutdown hook / application lifecycle hook to close the HTTP client; or rely on process lifetime but document it.
-
Threading & pool usage
CompletableFuture.supplyAsync(..., executorService()) is correct, but pulls/waits can be long. Ensure your executor is sized for this (don’t starve other workflow tasks). Consider a dedicated pool for container jobs. -
Return value & output contract
startSync returns the same input if exit is 0. If containers are expected to produce artifacts/logs/files, make that contract explicit (e.g., volumes or captured stdout). If stdout should become the task output, wire it. -
Port/volume validation
Your property setters likely map ports/volumes directly. Validate host paths (no / escape tricks), read-only when possible, and sanitize port ranges. -
Name & label hygiene
Add consistent labels (workflow id, task id, run id) and a predictable Name. This helps you find & clean orphans if the engine crashes.
e.g. labels: sw.workflow=, sw.instance=, sw.task=, sw.run=.
- Image “latest” pitfalls
Defaulting to latest is reproducibility-hostile.
- Recommendation: require explicit tags in production; optionally enforce via validation.
- Map more exit codes carefully
Your map is fine, but consider surfacing exit status + signal from inspect to avoid mislabeling (e.g., 137 might be OOMKill vs SIGKILL). You can fetch inspectContainerCmd(id).getState().getOOMKilled() and include that.
I reviewed and changed every item on this report.
| assertTrue(ports.containsKey(ExposedPort.tcp(8881))); | ||
| assertTrue(ports.containsKey(ExposedPort.tcp(8882))); | ||
|
|
||
| dockerClient.removeContainerCmd(containerId).withForce(true).exec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this to a finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to all tests; the procedure remains the same.
| - runContainer: | ||
| run: | ||
| container: | ||
| image: alpine:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use busybox instead: https://hub.docker.com/_/busybox
It's less than 1MB.
|
Some comments on the IA report Timeout result handling IA is not perfect ;), I already grasped this one, the Integer (null or not) is passed to the upper methods, which checks for null in startSync method Threading & pool usage hmmm, picky IA, the answer is not, we are not going to create specific executors for potentially long tasks. Return value & output contract Good point that I believe we discussed briefly offline. I do not see it included here |
|
policy may be null → NPE risk This in intentional, the == comparison is not going to cause a NPE |
|
Image allow-list / registry auth I do not exactly see what the issue is, but in any case we already have a configuration object . |
|
Graceful stop semantics I do not think hardcoded timeouts are ok, this should be read from config. Also, probably the stop container can be run asynchronously, but this we can leave to a follow-up PR,. As first version is more than fine |
|
Image “latest” pitfalls We are defaulting to latest only if user has not set anything. I do not think the alternative, force him to include the version in the image name or throw exception, is more desirable than the current approach. As we usually said, we should prioritize user experience. |
|
Singleton Docker client lifecycle This is an intentional design choice. This is lazily loaded by the JVM the first time is used and keep alive till the process is done (since the WorkflowApplication should be tied with the proces lifecycle, it should be fine) |
|
Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers. Yes, a property setter is setting it when policy is ALWAYS or NEVER |
I agree, we can keep using |
But the point is to set a default, I think it's a fair game. |
03d97ac to
0728b37
Compare
…ow#963) Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: fjtirado <ftirados@redhat.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Co-authored-by: Ricardo Zanini <1538000+ricardozanini@users.noreply.github.com> Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
Signed-off-by: Dmitrii Tikhomirov <chani.liet@gmail.com>
0728b37 to
f8c70c6
Compare
Many thanks for submitting your Pull Request ❤️!
What this PR does / why we need it:
Special notes for reviewers:
Additional information (if needed):