-
Notifications
You must be signed in to change notification settings - Fork 433
Open
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
Setup:
With custom change of GPU Operator
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke
Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver
helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_04 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_4 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev
During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.
Both:
- Could not find ld.so.cache
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
time="2024-04-22T19:37:03Z" level=info msg="Using driver version 535.129.03"
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
- Feature related stuff
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate nvidia/535.129.03/gsp*.bin: pattern nvidia/535.129.03/gsp*.bin not found"
k8s device plugin only
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
....
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
Wondering is there any warning worth further investigation ? For example vulkan/icd.d/nvidia_icd.json, it is actually under like
/home/kubernetes/bin/nvidia/vulkan/icd.d $ ls
nvidia_icd.json
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.