You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add note about missing user ids when starting containers (#282)
This seems to be a recurring issue that users bump into. Since it isn't
very easy to guess what's wrong from the error message I propose to add
this to the docs, so that the error message is at least searchable.
I've added it to the containers section now, but it could also/instead
added to the slurm section. Thoughts?
---------
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Copy file name to clipboardExpand all lines: docs/software/container-engine/known-issue.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
79
79
-**Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
80
80
81
81
To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
82
+
83
+
[](){#ref-ce-no-user-id}
84
+
## Container start fails with `id: cannot find name for user ID`
85
+
86
+
If your slurm job using a container fails to start with an error message similar to:
87
+
```console
88
+
slurmstepd: error: pyxis: container start failed with error code: 1
89
+
slurmstepd: error: pyxis: container exited too soon
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
97
+
slurmstepd: error: Failed to invoke spank plugin stack
98
+
srun: error: nid001234: task 0: Exited with exit code 1
99
+
srun: Terminating StepId=12345.0
100
+
```
101
+
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
102
+
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
103
+
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
Copy file name to clipboardExpand all lines: docs/software/container-engine/run.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,10 @@ There are three ways to do so:
24
24
!!! note "Shared container at the node-level"
25
25
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
26
26
27
+
!!! warning "Container start failure with `id: cannot find name for user ID`"
28
+
Containers may fail to start due to user database issues on compute nodes.
29
+
See [this section][ref-ce-no-user-id] for more details.
30
+
27
31
### Use from batch scripts
28
32
29
33
Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):
0 commit comments