Skip to content

Commit ffd6958

Browse files
authored
Merge pull request #99730 from mburke5678/nodes-dra
OSDOCS 12580 Enable Dynamic Resource Allocations for openshift
2 parents 6228b86 + e8d23ba commit ffd6958

File tree

6 files changed

+263
-0
lines changed

6 files changed

+263
-0
lines changed

_attributes/common-attributes.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,3 +414,6 @@ endif::openshift-origin[]
414414
// Formerly out-of-cluster layering
415415
:image-mode-os-out-caps: Out-of-cluster image mode
416416
:image-mode-os-out-lower: out-of-cluster image mode
417+
// Use after feature GAs in 4.21?? In 4.20, the Operator isn't used :attribute-based-full: Attribute-Based GPU Allocation in OpenShift with the NVIDIA GPU Operator
418+
:attribute-based-full: Attribute-Based GPU Allocation
419+
:attribute-based-short: Attribute-Based GPU Allocation

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2659,6 +2659,8 @@ Topics:
26592659
Distros: openshift-enterprise,openshift-origin
26602660
- Name: Placing pods on specific nodes using node selectors
26612661
File: nodes-pods-node-selectors
2662+
- Name: Allocating GPUs to pods
2663+
File: nodes-pods-allocate-dra
26622664
Distros: openshift-enterprise,openshift-origin
26632665
- Name: Run Once Duration Override Operator
26642666
Dir: run_once_duration_override
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-pods-allocate-dra.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="nodes-pods-allocate-dra-about_{context}"]
7+
= About allocating GPUs to workloads
8+
9+
// Taken from https://issues.redhat.com/browse/OCPSTRAT-1756
10+
{attribute-based-full} enables pods to request graphics processing units (GPU) based on specific device attributes. This ensures that each pod receives the exact GPU specifications it requires.
11+
12+
// Hiding until GA. The driver is not integrated in the TP version.
13+
// With the NVIDIA Kubernetes DRA driver integrated into OpenShift,by the NVIDIA GPU Operator with a DRA driver
14+
15+
Attribute-based resource allocation requires that you install a Dynamic Resource Allocation (DRA) driver. A DRA driver is a third-party application that runs on each node in your cluster to interface with the hardware of that node.
16+
17+
The DRA driver advertises several GPU device attributes that {product-title} can use for precise GPU selection, including the following attributes:
18+
19+
Product Name::
20+
Pods can request an exact GPU model based on performance requirements or compatibility with applications. This ensures that workloads leverage the best-suited hardware for their tasks.
21+
22+
GPU Memory Capacity::
23+
Pods can request GPUs with a minimum or maximum memory capacity, such as 8 GB, 16 GB, or 40 GB. This is helpful with memory-intensive workloads such as large AI model training or data processing. This attribute enables applications to allocate GPUs that meet memory needs without overcommitting or underutilizing resources.
24+
25+
Compute Capability::
26+
Pods can request GPUs based on the compute capabilities of the GPU, such as the CUDA versions supported. Pods can target GPUs that are compatible with the application’s framework and leverage optimized processing capabilities.
27+
28+
Power and Thermal Profiles::
29+
Pods can request GPUs based on power usage or thermal characteristics, enabling power-sensitive or temperature-sensitive applications to operate efficiently. This is particularly useful in high-density environments where energy or cooling constraints are factors.
30+
31+
Device ID and Vendor ID::
32+
Pods can request GPUs based on the GPU's hardware specifics, which allows applications that require specific vendors or device types to make targeted requests.
33+
34+
Driver Version::
35+
Pods can request GPUs that run a specific driver version, ensuring compatibility with application dependencies and maximizing GPU feature access.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-pods-allocate-dra.adoc
4+
5+
:_mod-docs-content-type: REFERENCE
6+
[id="nodes-pods-allocate-dra-configure-about_{context}"]
7+
= About GPU allocation objects
8+
9+
// Taken from https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#terminology
10+
{attribute-based-full} uses the following objects to provide the core graphics processing unit (GPU) allocation functionality. All of these API kinds are included in the `resource.k8s.io/v1beta2` API group.
11+
12+
Device class::
13+
A device class is a category of devices that pods can claim and how to select specific device attributes in claims. Some device drivers contain their own device class. Alternatively, an administrator can create device classes. A device class contains a device selector, which is a link:https://cel.dev/[common expression language (CEL)] expression that must evaluate to true if a device satisfies the request.
14+
+
15+
The following example `DeviceClass` object selects any device that is managed by the `driver.example.com` device driver:
16+
+
17+
.Example device class object
18+
[source,yaml]
19+
----
20+
apiVersion: resource.k8s.io/v1beta1
21+
kind: DeviceClass
22+
metadata:
23+
name: example-device-class
24+
spec:
25+
selectors:
26+
- cel:
27+
expression: |-
28+
device.driver == "driver.example.com"
29+
----
30+
31+
Resource slice::
32+
The Dynamic Resource Allocation (DRA) driver on each node creates and manages _resource slices_ in the cluster. A resource slice represents one or more GPU resources that are attached to nodes. When a resource claim is created and used in a pod, {product-title} uses the resource slices to find nodes that have access to the requested resources. After finding an eligible resource slice for the resource claim, the {product-title} scheduler updates the resource claim with the allocation details, allocates resources to the resource claim, and schedules the pod onto a node that can access the resources.
33+
34+
Resource claim template::
35+
Cluster administrators and operators can create a _resource claim template_ to request a GPU from a specific device class. Resource claim templates provide pods with access to separate, similar resources. {product-title} uses a resource claim template to generate a resource claim for the pod. Each resource claim that {product-title} generates from the template is bound to a specific pod. When the pod terminates, {product-title} deletes the corresponding resource claim.
36+
+
37+
The following example resource claim template requests devices in the `example-device-class` device class.
38+
+
39+
.Example resource claim template object
40+
[source,yaml]
41+
----
42+
apiVersion: resource.k8s.io/v1beta1
43+
kind: ResourceClaimTemplate
44+
metadata:
45+
namespace: gpu-test1
46+
name: gpu-claim-template
47+
spec:
48+
# ...
49+
spec:
50+
devices:
51+
requests:
52+
- name: gpu
53+
deviceClassName: example-device-class
54+
----
55+
56+
Resource claim::
57+
Admins and operators can create a _resource claim_ to request a GPU from a specific device class. A resource claim differs from a resource claim template by allowing you to share GPUs with multiple pods. Also, resource claims are not deleted when a requesting pod is terminated.
58+
+
59+
The following example resource claim template uses CEL expressions to request specific devices in the `example-device-class` device class that are of a specific size.
60+
+
61+
.Example resource claim object
62+
[source,yaml]
63+
----
64+
apiVersion: resource.k8s.io/v1beta1
65+
kind: ResourceClaimTemplate
66+
metadata:
67+
namespace: gpu-claim
68+
name: gpu-devices
69+
spec:
70+
spec:
71+
devices:
72+
requests:
73+
- name: 1g-5gb
74+
deviceClassName: example-device-class
75+
selectors:
76+
- cel:
77+
expression: "device.attributes['driver.example.com'].profile == '1g.5gb'"
78+
- name: 1g-5gb-2
79+
deviceClassName: example-device-class
80+
selectors:
81+
- cel:
82+
expression: "device.attributes['driver.example.com'].profile == '1g.5gb'"
83+
- name: 2g-10gb
84+
deviceClassName: example-device-class
85+
selectors:
86+
- cel:
87+
expression: "device.attributes['driver.example.com'].profile == '2g.10gb'"
88+
- name: 3g-20gb
89+
deviceClassName: example-device-class
90+
selectors:
91+
- cel:
92+
expression: "device.attributes['driver.example.com'].profile == '3g.20gb'"
93+
----
94+
95+
For more information on configuring resource claims, resource claim templates, see link:https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/["Dynamic Resource Allocation"] (Kubernetes documentation).
96+
97+
For information on adding resource claims to pods, see "Adding resource claims to pods".
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-pods-allocate-dra.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="nodes-pods-allocate-dra-configure_{context}"]
7+
= Adding resource claims to pods
8+
9+
{attribute-based-full} uses resource claims and resource claim templates to allow you to request specific graphics processing units (GPU) for the containers in your pods. Resource claims can be used with multiple containers, but resource claim templates can be used with only one container. For more information, see "About configuring device allocation by using device attributes" in the _Additional Resources_ section.
10+
11+
The example in the following procedure creates a resource claim template to assign a specific GPU to `container0` and a resource claim to share a GPU between `container1` and `container2`.
12+
13+
.Prerequisites
14+
15+
* A Dynamic Resource Allocation (DRA) driver is installed. For more information on DRA, see link:https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/["Dynamic Resource Allocation"] (Kubernetes documentation).
16+
//Remove for TP * The Nvidia GPU Operator is installed. For more information see "Adding Operators to a cluster" in the _Additional Resources_ section.
17+
* A resource slice has been created.
18+
* A resource claim and/or resource claim template has been created.
19+
* You enabled the required Technology Preview features for your cluster by editing the `FeatureGate` CR named `cluster`:
20+
+
21+
.Example `FeatureGate` CR
22+
[source,yaml]
23+
----
24+
apiVersion: config.openshift.io/v1
25+
kind: FeatureGate
26+
metadata:
27+
name: cluster
28+
spec:
29+
featureSet: TechPreviewNoUpgrade <1>
30+
----
31+
<1> Enables the required features.
32+
+
33+
[WARNING]
34+
====
35+
Enabling the `TechPreviewNoUpgrade` feature set on your cluster cannot be undone and prevents minor version updates. This feature set allows you to enable these Technology Preview features on test clusters, where you can fully test them. Do not enable this feature set on production clusters.
36+
====
37+
38+
.Procedure
39+
40+
. Create a pod by creating a YAML file similar to the following:
41+
+
42+
.Example pod that is requesting resources
43+
[source,yaml]
44+
----
45+
apiVersion: v1
46+
kind: Pod
47+
metadata:
48+
namespace: gpu-allocate
49+
name: pod1
50+
labels:
51+
app: pod
52+
spec:
53+
restartPolicy: Never
54+
containers:
55+
- name: container0
56+
image: ubuntu:24.04
57+
command: ["sleep", "9999"]
58+
resources:
59+
claims: <1>
60+
- name: gpu-claim-template
61+
- name: container1
62+
image: ubuntu:24.04
63+
command: ["sleep", "9999"]
64+
resources:
65+
claims:
66+
- name: gpu-claim
67+
- name: container2
68+
image: ubuntu:24.04
69+
command: ["sleep", "9999"]
70+
resources:
71+
claims:
72+
- name: gpu-claim
73+
resourceClaims: <2>
74+
- name: gpu-claim-template
75+
resourceClaimTemplateName: example-resource-claim-template
76+
- name: gpu-claim
77+
resourceClaimName: example-resource-claim
78+
----
79+
<1> Specifies one or more resource claims to use with this container.
80+
<2> Specifies the resource claims that are required for the containers to start. Include an arbitrary name for the resource claim request and the resource claim and/or resource claim template.
81+
82+
. Create the CRD object:
83+
+
84+
[source,terminal]
85+
----
86+
$ oc create -f <file_name>.yaml
87+
----
88+
89+
For more information on configuring pod resource requests, see link:https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/["Dynamic Resource Allocation"] (Kubernetes documentation).
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
:context: nodes-pods-allocate-dra
3+
[id="nodes-pods-allocate-dra"]
4+
= Allocating GPUs to pods
5+
include::_attributes/common-attributes.adoc[]
6+
7+
toc::[]
8+
9+
// Taken from https://issues.redhat.com/browse/OCPSTRAT-1756
10+
// Naming taken from https://issues.redhat.com/browse/OCPSTRAT-2384. Is this correct?
11+
{attribute-based-full} enables fine-tuned control over graphics processing unit (GPU) resource allocation in {product-title}, allowing pods to request GPUs based on specific device attributes, including product name, GPU memory capacity, compute capability, vendor name and driver version. These attributes are exposed by a third-party Dynamic Resource Allocation (DRA) driver.
12+
13+
// Hiding until GA. The driver is not integrated in the TP version
14+
// This attribute-based resource allocation is achieved through the integration of the NVIDIA Kubernetes DRA driver into OpenShift.
15+
16+
:FeatureName: {attribute-based-full}
17+
include::snippets/technology-preview.adoc[]
18+
19+
// The following include statements pull in the module files that comprise
20+
// the assembly. Include any combination of concept, procedure, or reference
21+
// modules required to cover the user story. You can also include other
22+
// assemblies.
23+
24+
include::modules/nodes-pods-allocate-dra-about.adoc[leveloffset=+1]
25+
26+
include::modules/nodes-pods-allocate-dra-configure-about.adoc[leveloffset=+1]
27+
28+
.Next steps
29+
* xref:../../nodes/pods/nodes-pods-allocate-dra.adoc#nodes-pods-allocate-dra-configure_nodes-pods-allocate-dra[Adding resource claims to pods]
30+
31+
include::modules/nodes-pods-allocate-dra-configure.adoc[leveloffset=+1]
32+
33+
[role="_additional-resources"]
34+
.Additional resources
35+
// Hiding until GA link:https://catalog.ngc.nvidia.com/orgs/nvidia/helm-charts/nvidia-dra-driver-gpu?version=25.3.2[NVIDIA DRA Driver for GPUs]
36+
// Hiding until GA * xref:../../operators/admin/olm-adding-operators-to-cluster.adoc#olm-adding-operators-to-a-cluster[Adding Operators to a cluster]
37+
* xref:../../nodes/pods/nodes-pods-allocate-dra.adoc#nodes-pods-allocate-dra-configure-about_nodes-pods-allocate-dra[About configuring device allocation by using device attributes]

0 commit comments

Comments
 (0)