|
23 | 23 | <!-- /toc --> |
24 | 24 |
|
25 | 25 | ## Summary |
26 | | -This proposal introduces a plugin to allow users to specify the priority of different resources and max resource |
27 | | -consumption for workload on differnet resources. |
| 26 | +This proposal introduces a plugin that enables users to set priorities for various resources and define maximum resource consumption limits for workloads across different resources. |
28 | 27 |
|
29 | 28 | ## Motivation |
30 | | -The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To |
| 29 | +A Kubernetes cluster typically consists of heterogeneous machines, with varying SKUs on CPU, memory, GPU, and pricing. To |
31 | 30 | efficiently utilize the different resources available in the cluster, users can set priorities for machines of different |
32 | 31 | types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running |
33 | 32 | on low priority nodes instead of high priority ones. |
34 | 33 |
|
35 | 34 | ### Use Cases |
36 | 35 |
|
37 | | -1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot |
38 | | -instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak |
39 | | -periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot |
40 | | -instances are prioritized to be scaled in. |
| 36 | +1. As a user of cloud services, there are some static but expensive VM instances and some dynamic but cheaper Spot |
| 37 | +instances in my cluster. I hope that my workload can be deployed first on static VM instances, and during business peak |
| 38 | +periods, the Pods that are scaled up are deployed on Spot instances. At the end of the business peak, the Pods on Spot |
| 39 | +instances are prioritized to be scaled down. |
41 | 40 |
|
42 | 41 | ### Goals |
43 | 42 |
|
44 | | -1. Develop a filter plugin to restrict the resource consumption on each unit for different workloads. |
45 | | -2. Develop a score plugin to favor nodes matched by a high priority unit. |
| 43 | +1. Develop a filter plugin to restrict the resource consumption on each kind of resource for different workloads. |
| 44 | +2. Develop a score plugin to favor nodes matched by a high priority kind of resource. |
46 | 45 | 3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller. |
47 | 46 |
|
48 | 47 | ### Non-Goals |
49 | 48 |
|
50 | | -1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in |
51 | | -sequence will be random. |
52 | | -2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the |
53 | | -ResourcePolicy, we will not attempt to delete the excess Pods. |
54 | | - |
| 49 | +1. Scheduler will not delete the pods. |
55 | 50 |
|
56 | 51 | ## Proposal |
57 | 52 |
|
58 | | -### CRD API |
| 53 | +### API |
59 | 54 | ```yaml |
60 | 55 | apiVersion: scheduling.sigs.x-k8s.io/v1alpha1 |
61 | 56 | kind: ResourcePolicy |
|
67 | 62 | - pod-template-hash |
68 | 63 | matchPolicy: |
69 | 64 | ignoreTerminatingPod: true |
70 | | - ignorePreviousPod: false |
71 | | - forceMaxNum: false |
72 | 65 | podSelector: |
73 | 66 | matchExpressions: |
74 | 67 | - key: key1 |
@@ -105,49 +98,70 @@ spec: |
105 | 98 | key1: value3 |
106 | 99 | ``` |
107 | 100 |
|
108 | | -`Priority` define the priority of each unit. Pods will be scheduled on units with a higher priority. |
| 101 | +```go |
| 102 | +type ResourcePolicy struct { |
| 103 | + ObjectMeta |
| 104 | + TypeMeta |
| 105 | + |
| 106 | + Spec ResourcePolicySpec |
| 107 | +} |
| 108 | +type ResourcePolicySpec struct { |
| 109 | + MatchLabelKeys []string |
| 110 | + MatchPolicy MatchPolicy |
| 111 | + Strategy string |
| 112 | + PodSelector metav1.LabelSelector |
| 113 | + Units []Unit |
| 114 | +} |
| 115 | +type MatchPolicy struct { |
| 116 | + IgnoreTerminatingPod bool |
| 117 | +} |
| 118 | +type Unit struct { |
| 119 | + Priority *int32 |
| 120 | + MaxCount *int32 |
| 121 | + NodeSelector metav1.LabelSelector |
| 122 | +} |
| 123 | +``` |
| 124 | + |
| 125 | +Pods will be matched by the ResourcePolicy in same namespace when the `.spec.podSelector`. And if `.spec.matchPolicy.ignoreTerminatingPod` is `true`, pods with Non-Zero `.spec.deletionTimestamp` will be ignored. |
| 126 | +ResourcePolicies will never match pods in different namesapces. One pod can not be matched by more than one Resource Policies. |
| 127 | + |
| 128 | +Pods can only be scheduled on units defined in `.spec.units` and this behavior can be changed by `.spec.strategy`. Each item in `.spec.units` contains a set of nodes that match the `NodeSelector` which describes a kind of resource in the cluster. |
| 129 | + |
| 130 | +`.spec.units[].priority` define the priority of each unit. Units with higher priority will get higher score in the score plugin. |
109 | 131 | If all units have the same priority, resourcepolicy will only limit the max pod on these units. |
| 132 | +If the `.spec.units[].priority` is not set, the default value is 0. |
| 133 | +`.spec.units[].maxCount` define the maximum number of pods that can be scheduled on each unit. If `.spec.units[].maxCount` is not set, pods can always be scheduled on the units except there is no enough resource. |
110 | 134 |
|
111 | | -`Strategy` indicate how we treat the nodes doesn't match any unit. |
| 135 | +`.spec.strategy` indicate how we treat the nodes doesn't match any unit. |
112 | 136 | If strategy is `required`, the pod can only be scheduled on nodes that match the units in resource policy. |
113 | 137 | If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not match the units will be |
114 | 138 | considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable` |
115 | 139 | for those nodes not match the units. |
116 | 140 |
|
117 | | -`MatchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like |
118 | | -`MatchLabelKeys` in `PodTopologySpread`. |
119 | | - |
120 | | -`matchPolicy` indicate if we should ignore some kind pods when calculate pods in certain unit. |
121 | | - |
122 | | -If `forceMaxNum` is set `true`, we will not try the next units when one unit is not full, this property have no effect |
123 | | -when `max` is not set in units. |
| 141 | +`.spec.matchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like |
| 142 | +`.spec.matchLabelKeys` in `PodTopologySpread`. |
124 | 143 |
|
125 | 144 | ### Implementation Details |
126 | 145 |
|
127 | | -#### Scheduler Plugins |
128 | | - |
129 | | -For each unit, we will record which pods were scheduled on it to prevent too many pods scheduled on it. |
130 | | - |
131 | | -##### PreFilter |
| 146 | +#### PreFilter |
132 | 147 | PreFilter check if the current pods match only one resource policy. If not, PreFilter will reject the pod. |
133 | 148 | If yes, PreFilter will get the number of pods on each unit to determine which units are available for the pod |
134 | 149 | and write this information into cycleState. |
135 | 150 |
|
136 | | -##### Filter |
| 151 | +#### Filter |
137 | 152 | Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return |
138 | | -success if the strategy is `prefer`, otherwise we will return unschedulable. |
| 153 | +success if the `.spec.strategy` is `prefer`, otherwise we will return unschedulable. |
139 | 154 |
|
140 | 155 | Besides, filter will check if the pods that was scheduled on the unit has already violated the quantity constraint. |
141 | | -If the number of pods has reach the `maxCount`, all the nodes in unit will be marked unschedulable. |
| 156 | +If the number of pods has reach the `.spec.unit[].maxCount`, all the nodes in unit will be marked unschedulable. |
142 | 157 |
|
143 | | -##### Score |
144 | | -If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum |
145 | | -priority is 1. |
| 158 | +#### Score |
| 159 | +If `.spec.unit[].priority` is set in resource policy, we will schedule pod based on `.spec.unit[].priority`. Default priority is 0, and minimum |
| 160 | +priority is 0. |
146 | 161 |
|
147 | 162 | Score calculation details: |
148 | 163 |
|
149 | | -1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum |
150 | | -score. |
| 164 | +1. calculate priority score, `scorePriority = (priority) * 20`, to make sure we give nodes without priority a minimum score. |
151 | 165 | 2. normalize score |
152 | 166 |
|
153 | 167 | #### Resource Policy Controller |
|
0 commit comments