Add note about missing user ids when starting containers (#282)

msimberg · RMeli · web-flow · commit b3e9672dde80 · 2025-10-16T09:30:05.000Z
This seems to be a recurring issue that users bump into. Since it isn't
very easy to guess what's wrong from the error message I propose to add
this to the docs, so that the error message is at least searchable.

I've added it to the containers section now, but it could also/instead
added to the slurm section. Thoughts?

---------

Co-authored-by: Rocco Meli &lt;r.meli@bluemail.ch&gt;
diff --git a/docs/software/container-engine/known-issue.md b/docs/software/container-engine/known-issue.md
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
  - **Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
 
 To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
+
+[](){#ref-ce-no-user-id}
+## Container start fails with `id: cannot find name for user ID`
+
+If your slurm job using a container fails to start with an error message similar to:
+```console
+slurmstepd: error: pyxis: container start failed with error code: 1
+slurmstepd: error: pyxis: container exited too soon
+slurmstepd: error: pyxis: printing engine log file:
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
+slurmstepd: error: pyxis: couldn't start container
+slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
+slurmstepd: error: Failed to invoke spank plugin stack
+srun: error: nid001234: task 0: Exited with exit code 1
+srun: Terminating StepId=12345.0
+```
+it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
+If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
+You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
+```console
+$ sinfo --nodes=nid006886
+PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+debug        up    1:30:00      0    n/a
+normal*      up   12:00:00      1 drain$ nid006886
+xfer         up 1-00:00:00      0    n/a
+```
diff --git a/docs/software/container-engine/run.md b/docs/software/container-engine/run.md
@@ -24,6 +24,10 @@ There are three ways to do so:
 !!! note "Shared container at the node-level"
     For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
 
+!!! warning "Container start failure with `id: cannot find name for user ID`"
+    Containers may fail to start due to user database issues on compute nodes.
+    See [this section][ref-ce-no-user-id] for more details.
+
 ### Use from batch scripts
 
 Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):