You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[nvidia-ctk-installer] do not revert cri-o config on shutdown
This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:
```
failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...
```
There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.
This commit can be considered a stop-gap solution until more robust solution is developed.
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
0 commit comments