Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Oct 31, 2025

This change ensures that directories that are already included in the ld.conf are not added to the NVCR-specific config files. The intent is to not inadventently promote existing directories to a higher priority.

See #123 where this was already an issue in the libnvidia-container-based implementation.

Testing

build the following docker image

docker build -t libordering \
            - <<EOF
FROM ubuntu
ENV NVIDIA_VISIBLE_DEVICES=all
RUN mkdir -p /extra/lib
RUN cp /usr/lib/$(uname -m)-linux-gnu/libc.so.? /extra/lib/
RUN echo "/extra/lib" > /etc/ld.so.conf.d/00-xxx.conf
RUN ldconfig
EOF

Without the nvidia runtime:

$ docker run --rm -ti --runtime=runc libordering bash -c "ldconfig -p | grep libc.so."
        libc.so.6 (libc6,AArch64) => /extra/lib/libc.so.6
        libc.so.6 (libc6,AArch64) => /lib/aarch64-linux-gnu/libc.so.6

Without this change:

$ docker run --rm -ti --runtime=nvidia libordering bash -c "ldconfig -p | grep libc.so."
        libc.so.6 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libc.so.6
        libc.so.6 (libc6,AArch64) => /extra/lib/libc.so.6

(Note that the libc.so.6 in /usr/lib/aarch64-linux-gnu takes precedence).

With this change:

$ docker run --rm -ti --runtime=nvidia libordering bash -c "ldconfig -p | grep libc.so."
        libc.so.6 (libc6,AArch64) => /extra/lib/libc.so.6
        libc.so.6 (libc6,AArch64) => /lib/aarch64-linux-gnu/libc.so.6

@elezar
Copy link
Member Author

elezar commented Oct 31, 2025

/cherry-pick release-1.18

@elezar elezar added this to the v1.18.1 milestone Oct 31, 2025
@elezar elezar self-assigned this Oct 31, 2025
@elezar elezar force-pushed the selective-ldcache-config branch from 0b4a83b to d0a1221 Compare October 31, 2025 15:25
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
2 non Blocking comments

return SafeExec(ldconfigPath, args, nil)
}

func (l *Ldconfig) filterDirectories(configFilePath string, directories ...string) ([]string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: If the function doesn't need to access or modify the state of an Ldconfig object, why not to be a regular standalone function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally accessing a member, but I can remove it.

@elezar elezar force-pushed the selective-ldcache-config branch 2 times, most recently from 2585da9 to 5b1d918 Compare November 5, 2025 14:32
@ArangoGutierrez ArangoGutierrez self-requested a review November 5, 2025 15:27
ArangoGutierrez
ArangoGutierrez previously approved these changes Nov 5, 2025
// Explicitly specify using /etc/ld.so.conf since the host's ldconfig may
// be configured to use a different config file by default.
configFilePath := "/etc/ld.so.conf"
filteredDirectories, err := filterDirectories(configFilePath, directories...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Jason from Uber AI Infra here - we filed the ticket) What will be passed in as directories?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directories contains the container paths of parent directories for any CUDA user-mode driver libraries that are mounted from the host. Note that at present the contianer paths of the libraries are the same as the host path.

On an ubuntu based system the list is typically:

/usr/lib/${ARCH_SPECIFIC_LIB_DIR}/
/usr/lib/${ARCH_SPECIFIC_LIB_DIR}/vdpau

which results in the following CDI hook:

        - hookName: createContainer
          path: /usr/bin/nvidia-cdi-hook
          args:
            - nvidia-cdi-hook
            - update-ldcache
            - --folder
            - /lib/x86_64-linux-gnu
            - --folder
            - /lib/x86_64-linux-gnu/vdpau
          env:
            - NVIDIA_CTK_DEBUG=false

If you wanted to know what the list is on your system, you could run the nvidia-ctk cdi generate command and check the output for the generated update-ldcache hook.

@elezar
Copy link
Member Author

elezar commented Nov 6, 2025

@jasonzlai do you have a simple reproducer Dockerfile that we could add to our test suite to address this?

@elezar elezar force-pushed the selective-ldcache-config branch from 5b1d918 to 1c7487a Compare November 11, 2025 10:16
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the selective-ldcache-config branch 2 times, most recently from 25fc385 to abf739b Compare November 12, 2025 09:23
This change fixes the behavior where the order of precedence
of existing folders are changed because they are added to the
.conf file for ldconfig.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the selective-ldcache-config branch from abf739b to 1bd5242 Compare November 12, 2025 09:46
@elezar elezar dismissed ArangoGutierrez’s stale review November 12, 2025 10:35

Additional changes.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@elezar elezar merged commit a7a8df2 into NVIDIA:main Nov 12, 2025
13 checks passed
@elezar elezar deleted the selective-ldcache-config branch November 12, 2025 20:43
@github-actions
Copy link

🤖 Backport PR created for release-1.18: #1449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants