Skip to content

Commit 8cddd20

Browse files
authored
Merge pull request #97359 from StephenJamesSmith/TELCODOCS-2140
TELCODOCS-2140
2 parents 737f89f + 1ecbac5 commit 8cddd20

10 files changed

+1213
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3576,6 +3576,8 @@ Topics:
35763576
File: gaudi-ai-accelerator
35773577
- Name: Remote Direct Memory Access (RDMA)
35783578
File: rdma-remote-direct-memory-access
3579+
- Name: Dynamic Accelerator Slicer (DAS) Operator
3580+
File: das-about-dynamic-accelerator-slicer-operator
35793581
---
35803582
Name: Backup and restore
35813583
Dir: backup_and_restore
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="das-about-dynamic-accelerator-slicer-operator"]
3+
= Dynamic Accelerator Slicer (DAS) Operator
4+
include::_attributes/common-attributes.adoc[]
5+
:context: das-about-dynamic-accelerator-slicer-operator
6+
7+
toc::[]
8+
9+
:FeatureName: Dynamic Accelerator Slicer Operator
10+
11+
include::snippets/technology-preview.adoc[]
12+
13+
The Dynamic Accelerator Slicer (DAS) Operator allows you to dynamically slice GPU accelerators in {product-title}, instead of relying on statically sliced GPUs defined when the node is booted. This allows you to dynamically slice GPUs based on specific workload demands, ensuring efficient resource utilization.
14+
15+
Dynamic slicing is useful if you do not know all the accelerator partitions needed in advance on every node on the cluster.
16+
17+
The DAS Operator currently includes a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to support additional technologies such as NVIDIA MPS or GPUs from other vendors in the future.
18+
19+
.Limitations
20+
21+
The following limitations apply when using the Dynamic Accelerator Slicer Operator:
22+
23+
* You need to identify potential incompatibilities and ensure the system works seamlessly with various GPU drivers and operating systems.
24+
25+
* The Operator only works with specific MIG compatible NVIDIA GPUs and drivers, such as H100 and A100.
26+
27+
* The Operator cannot use only a subset of the GPUs of a node.
28+
29+
* The NVIDIA device plugin cannot be used together with the Dynamic Accelerator Slicer Operator to manage the GPU resources of a cluster.
30+
31+
[NOTE]
32+
====
33+
The DAS Operator is designed to work with MIG-enabled GPUs. It allocates MIG slices instead of whole GPUs. Installing the DAS Operator prevents the use of the standard resource request through the NVIDIA device plugin such as `nvidia.com/gpu: "1"`, for allocating the entire GPU.
34+
====
35+
36+
//Installing the Dynamic Accelerator Slicer Operator
37+
include::modules/das-operator-installing.adoc[leveloffset=+1]
38+
39+
//Installing the Dynamic Accelerator Slicer Operator using the web console
40+
include::modules/das-operator-installing-web-console.adoc[leveloffset=+2]
41+
[role="_additional-resources"]
42+
.Additional resources
43+
** xref:../security/cert_manager_operator/cert-manager-operator-install.adoc#cert-manager-operator-install[{cert-manager-operator}]
44+
** xref:../hardware_enablement/psap-node-feature-discovery-operator.adoc#psap-node-feature-discovery-operator[Node Feature Discovery (NFD) Operator]
45+
** link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator]
46+
47+
** link:https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#creating-nfd-cr-web-console_psap-node-feature-discovery-operator[NodeFeatureDiscovery CR]
48+
49+
//Installing the Dynamic Accelerator Slicer Operator using the CLI
50+
include::modules/das-operator-installing-cli.adoc[leveloffset=+2]
51+
[role="_additional-resources"]
52+
.Additional resources
53+
* xref:../security/cert_manager_operator/cert-manager-operator-install.adoc#cert-manager-operator-install[{cert-manager-operator}]
54+
* xref:../hardware_enablement/psap-node-feature-discovery-operator.adoc#psap-node-feature-discovery-operator[Node Feature Discovery (NFD) Operator]
55+
* link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator]
56+
* link:https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#creating-nfd-cr-cli_psap-node-feature-discovery-operator[NodeFeatureDiscovery CR]
57+
58+
//Uninstalling the Dynamic Accelerator Slicer Operator
59+
include::modules/das-operator-uninstalling.adoc[leveloffset=+1]
60+
61+
//Uninstalling the Dynamic Accelerator Slicer Operator using the web console
62+
include::modules/das-operator-uninstalling-web-console.adoc[leveloffset=+2]
63+
64+
//Uninstalling the Dynamic Accelerator Slicer Operator using the CLI
65+
include::modules/das-operator-uninstalling-cli.adoc[leveloffset=+2]
66+
67+
//Deploying GPU workloads with the Dynamic Accelerator Slicer Operator
68+
include::modules/das-operator-deploying-workloads.adoc[leveloffset=+1]
69+
70+
//Troubleshooting DAS Operator
71+
include::modules/das-operator-troubleshooting.adoc[leveloffset=+1]
72+
73+
[role="_additional-resources"]
74+
.Additional resources
75+
* link:https://github.com/kubernetes/kubernetes/issues/128043[Kubernetes issue #128043]
76+
* xref:../hardware_enablement/psap-node-feature-discovery-operator.adoc#psap-node-feature-discovery-operator[Node Feature Discovery Operator]
77+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html[NVIDIA GPU Operator troubleshooting]
78+
79+
80+
81+
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
4+
//
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="das-operator-deploying-workloads_{context}"]
7+
= Deploying GPU workloads with the Dynamic Accelerator Slicer Operator
8+
9+
You can deploy workloads that request GPU slices managed by the Dynamic Accelerator Slicer (DAS) Operator. The Operator dynamically partitions GPU accelerators and schedules workloads to available GPU slices.
10+
11+
.Prerequisites
12+
13+
* You have MIG supported GPU hardware available in your cluster.
14+
* The NVIDIA GPU Operator is installed and the `ClusterPolicy` shows a **Ready** state.
15+
* You have installed the DAS Operator.
16+
17+
.Procedure
18+
19+
. Create a namespace by running the following command:
20+
+
21+
[source,terminal]
22+
----
23+
oc new-project cuda-workloads
24+
----
25+
26+
. Create a deployment that requests GPU resources using the NVIDIA MIG resource:
27+
+
28+
[source,yaml]
29+
----
30+
apiVersion: apps/v1
31+
kind: Deployment
32+
metadata:
33+
name: cuda-vectoradd
34+
spec:
35+
replicas: 2
36+
selector:
37+
matchLabels:
38+
app: cuda-vectoradd
39+
template:
40+
metadata:
41+
labels:
42+
app: cuda-vectoradd
43+
spec:
44+
restartPolicy: Always
45+
containers:
46+
- name: cuda-vectoradd
47+
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8
48+
resources:
49+
limits:
50+
nvidia.com/mig-1g.5gb: "1"
51+
command:
52+
- sh
53+
- -c
54+
- |
55+
env && /cuda-samples/vectorAdd && sleep 3600
56+
----
57+
58+
. Apply the deployment configuration by running the following command:
59+
+
60+
[source,terminal]
61+
----
62+
$ oc apply -f cuda-vectoradd-deployment.yaml
63+
----
64+
65+
. Verify that the deployment is created and pods are scheduled by running the following command:
66+
+
67+
[source,terminal]
68+
----
69+
$ oc get deployment cuda-vectoradd
70+
----
71+
+
72+
.Example output
73+
[source,terminal]
74+
----
75+
NAME READY UP-TO-DATE AVAILABLE AGE
76+
cuda-vectoradd 2/2 2 2 2m
77+
----
78+
79+
. Check the status of the pods by running the following command:
80+
+
81+
[source,terminal]
82+
----
83+
$ oc get pods -l app=cuda-vectoradd
84+
----
85+
+
86+
.Example output
87+
[source,terminal]
88+
----
89+
NAME READY STATUS RESTARTS AGE
90+
cuda-vectoradd-6b8c7d4f9b-abc12 1/1 Running 0 2m
91+
cuda-vectoradd-6b8c7d4f9b-def34 1/1 Running 0 2m
92+
----
93+
94+
.Verification
95+
96+
. Check that `AllocationClaim` resources were created for your deployment pods by running the following command:
97+
+
98+
[source,terminal]
99+
----
100+
$ oc get allocationclaims -n das-operator
101+
----
102+
+
103+
.Example output
104+
[source,terminal]
105+
----
106+
NAME AGE
107+
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 2m
108+
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 2m
109+
----
110+
111+
. Verify that the GPU slices are properly allocated by checking one of the pod's resource allocation by running the following command:
112+
+
113+
[source,terminal]
114+
----
115+
$ oc describe pod -l app=cuda-vectoradd
116+
----
117+
118+
. Check the logs to verify the CUDA sample application runs successfully by running the following command:
119+
+
120+
[source,terminal]
121+
----
122+
$ oc logs -l app=cuda-vectoradd
123+
----
124+
+
125+
.Example output
126+
[source,terminal]
127+
----
128+
[Vector addition of 50000 elements]
129+
Copy input data from the host memory to the CUDA device
130+
CUDA kernel launch with 196 blocks of 256 threads
131+
Copy output data from the CUDA device to the host memory
132+
Test PASSED
133+
----
134+
135+
. Check the environment variables to verify that the GPU devices are properly exposed to the container by running the following command:
136+
+
137+
[source,terminal]
138+
----
139+
$ oc exec deployment/cuda-vectoradd -- env | grep -E "(NVIDIA_VISIBLE_DEVICES|CUDA_VISIBLE_DEVICES)"
140+
----
141+
+
142+
.Example output
143+
[source,terminal]
144+
----
145+
NVIDIA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
146+
CUDA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
147+
----
148+
+
149+
These environment variables indicate that the GPU MIG slice has been properly allocated and is visible to the CUDA runtime within the container.

0 commit comments

Comments
 (0)