From aea87272310aa5cb386528b2ab61b0a7edb4e018 Mon Sep 17 00:00:00 2001 From: KunWuLuan Date: Thu, 25 May 2023 09:34:53 +0800 Subject: [PATCH 1/5] add kep --- kep/594-resourcepolicy/README.md | 144 +++++++++++++++++++++++++++++++ kep/594-resourcepolicy/kep.yaml | 5 ++ 2 files changed, 149 insertions(+) create mode 100644 kep/594-resourcepolicy/README.md create mode 100644 kep/594-resourcepolicy/kep.yaml diff --git a/kep/594-resourcepolicy/README.md b/kep/594-resourcepolicy/README.md new file mode 100644 index 0000000000..4a65eddf10 --- /dev/null +++ b/kep/594-resourcepolicy/README.md @@ -0,0 +1,144 @@ +# Resource Policy + +## Table of Contents + +- Summary +- Motivation + - Goals + - Non-Goals +- Proposal + - CRD API + - Implementation details +- Use Cases +- Known limitations +- Test plans +- Graduation criteria +- Production Readiness Review Questionnaire + - Feature enablement and rollback +- Implementation history + +## Summary +This proposal introduces a plugin to allow users to specify the priority of different resources and max resource consumption for workload on differnet resources. + +## Motivation +The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To efficiently utilize the different resources available in the cluster, users can set priorities for machines of different types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running on low priority nodes instead of high priority ones. + +### Use Cases + +1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot instances are prioritized to be scaled in. + +### Goals + +1. Delvelop a filter plugin to restrict the resource consumption on each unit for different workloads. +2. Develop a score plugin to favor nodes matched by a high priority unit. +3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller. + +### Non-Goals + +1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in sequence will be random. +2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the ResourcePolicy, we will not attempt to delete the excess Pods. + + +## Proposal + +### CRD API +```yaml +apiVersion: scheduling.sigs.x-k8s.io/v1alpha1 +kind: ResourcePolicy +metadata: + name: xxx + namespace: xxx +spec: + podSelector: + matchExpressions: + - key: key1 + operator: In + values: + - value1 + matchLabels: + key1: value1 + strategy: prefer + units: + - name: unit1 + priority: 5 + maxCount: 10 + nodeSelector: + matchExpressions: + - key: key1 + operator: In + values: + - value1 + - name: unit2 + priority: 5 + maxCount: 10 + nodeSelector: + matchExpressions: + - key: key1 + operator: In + values: + - value2 + - name: unit3 + priority: 4 + maxCount: 20 + nodeSelector: + matchLabels: + key1: value3 +``` + +`Priority` define the priority of each unit. Pods will be scheduled on units with a higher priority. +If all units have the same priority, resourcepolicy will only limit the max pod on these units. + +`Strategy` indicate how we treat the nodes doesn't match any unit. +If strategy is `required`, the pod can only be scheduled on nodes that match the units in resource policy. +If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not match the units will be +considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable` +for those nodes not match the units. + +### Implementation Details + + +#### Scheduler Plugins + +For each unit, we will record which pods were scheduled on it to prevent too many pods scheduled on it. + +##### PreFilter +PreFilter check if the current pods match only one resource policy. If not, PreFilter will reject the pod. +If yes, PreFilter will get the number of pods on each unit to determine which units are available for the pod +and write this information into cycleState. + +##### Filter +Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return +success if the strategy is `prefer`, otherwise we will return unschedulable. + +##### Score +If `priority` and `weight` is set in resource policy, we will schedule pod based on `priority` first. For units with the same `priority`, we will spread pods based on `weight`. + +Score calculation details: + +1. calculate priority score, `scorePriority = priority * 20` +2. normalize score + +##### PostFilter + + +#### Resource Policy Controller +Resource policy controller set deletion cost on pods when the related resource policies were updated or added. + +## Known limitations + +- Currently deletion costs only take effect on deployment workload. + +## Test plans + +1. Add detailed unit and integration tests for the plugin and controller. +2. Add basic e2e tests, to ensure all components are working together. + +## Graduation criteria + +## Production Readiness Review Questionnaire + +## Feature enablement and rollback + +## Implementation history + + diff --git a/kep/594-resourcepolicy/kep.yaml b/kep/594-resourcepolicy/kep.yaml new file mode 100644 index 0000000000..32b3a3f9af --- /dev/null +++ b/kep/594-resourcepolicy/kep.yaml @@ -0,0 +1,5 @@ +title: Resourcepolicy +kep-number: 594 +authors: + - "@KunWuLuan" + - "@fjding" From da87fd468e92b7b34fe119c5a1080ae513efc209 Mon Sep 17 00:00:00 2001 From: KunWuLuan Date: Tue, 12 Sep 2023 14:11:16 +0800 Subject: [PATCH 2/5] update kep details --- kep/594-resourcepolicy/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kep/594-resourcepolicy/README.md b/kep/594-resourcepolicy/README.md index 4a65eddf10..933d1463dc 100644 --- a/kep/594-resourcepolicy/README.md +++ b/kep/594-resourcepolicy/README.md @@ -29,7 +29,7 @@ The machines in a Kubernetes cluster are typically heterogeneous, with varying C ### Goals -1. Delvelop a filter plugin to restrict the resource consumption on each unit for different workloads. +1. Develop a filter plugin to restrict the resource consumption on each unit for different workloads. 2. Develop a score plugin to favor nodes matched by a high priority unit. 3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller. @@ -110,12 +110,15 @@ and write this information into cycleState. Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return success if the strategy is `prefer`, otherwise we will return unschedulable. +Besides, filter will check if the pods that was scheduled on the unit has already violated the quantity constraint. +If the number of pods has reach the `maxCount`, all the nodes in unit will be marked unschedulable. + ##### Score -If `priority` and `weight` is set in resource policy, we will schedule pod based on `priority` first. For units with the same `priority`, we will spread pods based on `weight`. +If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum priority is 1. Score calculation details: -1. calculate priority score, `scorePriority = priority * 20` +1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum score. 2. normalize score ##### PostFilter From fe23556363b6d492745895f22029fefb7eb5331b Mon Sep 17 00:00:00 2001 From: KunWuLuan Date: Fri, 19 Jan 2024 16:44:02 +0800 Subject: [PATCH 3/5] add some more details in kep --- kep/594-resourcepolicy/README.md | 65 +++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 13 deletions(-) diff --git a/kep/594-resourcepolicy/README.md b/kep/594-resourcepolicy/README.md index 933d1463dc..60b030c6db 100644 --- a/kep/594-resourcepolicy/README.md +++ b/kep/594-resourcepolicy/README.md @@ -18,14 +18,21 @@ - Implementation history ## Summary -This proposal introduces a plugin to allow users to specify the priority of different resources and max resource consumption for workload on differnet resources. +This proposal introduces a plugin to allow users to specify the priority of different resources and max resource +consumption for workload on differnet resources. ## Motivation -The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To efficiently utilize the different resources available in the cluster, users can set priorities for machines of different types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running on low priority nodes instead of high priority ones. +The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To +efficiently utilize the different resources available in the cluster, users can set priorities for machines of different +types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running +on low priority nodes instead of high priority ones. ### Use Cases -1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot instances are prioritized to be scaled in. +1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot +instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak +periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot +instances are prioritized to be scaled in. ### Goals @@ -35,8 +42,10 @@ The machines in a Kubernetes cluster are typically heterogeneous, with varying C ### Non-Goals -1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in sequence will be random. -2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the ResourcePolicy, we will not attempt to delete the excess Pods. +1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in +sequence will be random. +2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the +ResourcePolicy, we will not attempt to delete the excess Pods. ## Proposal @@ -49,6 +58,12 @@ metadata: name: xxx namespace: xxx spec: + matchLabelKeys: + - pod-template-hash + matchPolicy: + ignoreTerminatingPod: true + ignorePreviousPod: false + forceMaxNum: false podSelector: matchExpressions: - key: key1 @@ -94,8 +109,15 @@ If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable` for those nodes not match the units. -### Implementation Details +`MatchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like +`MatchLabelKeys` in `PodTopologySpread`. + +`matchPolicy` indicate if we should ignore some kind pods when calculate pods in certain unit. +If `forceMaxNum` is set `true`, we will not try the next units when one unit is not full, this property have no effect +when `max` is not set in units. + +### Implementation Details #### Scheduler Plugins @@ -114,16 +136,15 @@ Besides, filter will check if the pods that was scheduled on the unit has alread If the number of pods has reach the `maxCount`, all the nodes in unit will be marked unschedulable. ##### Score -If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum priority is 1. +If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum +priority is 1. Score calculation details: -1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum score. +1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum +score. 2. normalize score -##### PostFilter - - #### Resource Policy Controller Resource policy controller set deletion cost on pods when the related resource policies were updated or added. @@ -138,10 +159,28 @@ Resource policy controller set deletion cost on pods when the related resource p ## Graduation criteria -## Production Readiness Review Questionnaire +This plugin will not be enabled only when users enable it in scheduler framework and create a resourcepolicy for pods. +So it is safe to be beta. + +* Beta +- [ ] Add node E2E tests. +- [ ] Provide beta-level documentation. ## Feature enablement and rollback -## Implementation history +Enable resourcepolicy in MultiPointPlugin to enable this plugin, like this: + +```yaml +piVersion: kubescheduler.config.k8s.io/v1 +kind: KubeSchedulerConfiguration +leaderElection: + leaderElect: false +profiles: +- schedulerName: default-scheduler + plugins: + multiPoint: + enabled: + - name: resourcepolicy +``` From 0c4e378ba07c0c765156f3ec30e58dc7ed3abd9d Mon Sep 17 00:00:00 2001 From: KunWuLuan Date: Tue, 26 Mar 2024 20:53:33 +0800 Subject: [PATCH 4/5] update toc Signed-off-by: KunWuLuan --- kep/594-resourcepolicy/README.md | 33 ++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/kep/594-resourcepolicy/README.md b/kep/594-resourcepolicy/README.md index 60b030c6db..a4e58e46f7 100644 --- a/kep/594-resourcepolicy/README.md +++ b/kep/594-resourcepolicy/README.md @@ -2,20 +2,25 @@ ## Table of Contents -- Summary -- Motivation - - Goals - - Non-Goals -- Proposal - - CRD API - - Implementation details -- Use Cases -- Known limitations -- Test plans -- Graduation criteria -- Production Readiness Review Questionnaire - - Feature enablement and rollback -- Implementation history + +- [Summary](#summary) +- [Motivation](#motivation) + - [Use Cases](#use-cases) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [CRD API](#crd-api) + - [Implementation Details](#implementation-details) + - [Scheduler Plugins](#scheduler-plugins) + - [PreFilter](#prefilter) + - [Filter](#filter) + - [Score](#score) + - [Resource Policy Controller](#resource-policy-controller) +- [Known limitations](#known-limitations) +- [Test plans](#test-plans) +- [Graduation criteria](#graduation-criteria) +- [Feature enablement and rollback](#feature-enablement-and-rollback) + ## Summary This proposal introduces a plugin to allow users to specify the priority of different resources and max resource From a7ec87fd33e3a0a311eff1d423056d720c6f6f8e Mon Sep 17 00:00:00 2001 From: KunWuLuan Date: Fri, 24 Jan 2025 09:30:20 +0800 Subject: [PATCH 5/5] Update README.md --- kep/594-resourcepolicy/README.md | 92 ++++++++++++++++++-------------- 1 file changed, 53 insertions(+), 39 deletions(-) diff --git a/kep/594-resourcepolicy/README.md b/kep/594-resourcepolicy/README.md index a4e58e46f7..cc3feb3985 100644 --- a/kep/594-resourcepolicy/README.md +++ b/kep/594-resourcepolicy/README.md @@ -23,39 +23,34 @@ ## Summary -This proposal introduces a plugin to allow users to specify the priority of different resources and max resource -consumption for workload on differnet resources. +This proposal introduces a plugin that enables users to set priorities for various resources and define maximum resource consumption limits for workloads across different resources. ## Motivation -The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To +A Kubernetes cluster typically consists of heterogeneous machines, with varying SKUs on CPU, memory, GPU, and pricing. To efficiently utilize the different resources available in the cluster, users can set priorities for machines of different types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running on low priority nodes instead of high priority ones. ### Use Cases -1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot -instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak -periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot -instances are prioritized to be scaled in. +1. As a administrator of kubernetes cluster, there are some static but expensive VM instances and some dynamic but cheaper Spot +instances in my cluster. I hope to restrict the resource consumption on each kind of resource for different workloads to limit the cost. +I hope that important workloads in my cluster can be deployed first on static VM instances so that they will not worry about been preempted. And during business peak periods, the Pods that are scaled up are deployed on cheap, spot instances. At the end of the business peak, the Pods on Spot +instances are prioritized to be scaled down. ### Goals -1. Develop a filter plugin to restrict the resource consumption on each unit for different workloads. -2. Develop a score plugin to favor nodes matched by a high priority unit. +1. Develop a filter plugin to restrict the resource consumption on each kind of resource for different workloads. +2. Develop a score plugin to favor nodes matched by a high priority kind of resource. 3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller. ### Non-Goals -1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in -sequence will be random. -2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the -ResourcePolicy, we will not attempt to delete the excess Pods. - +1. Scheduler will not delete the pods. ## Proposal -### CRD API +### API ```yaml apiVersion: scheduling.sigs.x-k8s.io/v1alpha1 kind: ResourcePolicy @@ -67,8 +62,6 @@ spec: - pod-template-hash matchPolicy: ignoreTerminatingPod: true - ignorePreviousPod: false - forceMaxNum: false podSelector: matchExpressions: - key: key1 @@ -105,49 +98,70 @@ spec: key1: value3 ``` -`Priority` define the priority of each unit. Pods will be scheduled on units with a higher priority. +```go +type ResourcePolicy struct { + ObjectMeta + TypeMeta + + Spec ResourcePolicySpec +} +type ResourcePolicySpec struct { + MatchLabelKeys []string + MatchPolicy MatchPolicy + Strategy string + PodSelector metav1.LabelSelector + Units []Unit +} +type MatchPolicy struct { + IgnoreTerminatingPod bool +} +type Unit struct { + Priority *int32 + MaxCount *int32 + NodeSelector metav1.LabelSelector +} +``` + +Pods will be matched by the ResourcePolicy in same namespace when the `.spec.podSelector`. And if `.spec.matchPolicy.ignoreTerminatingPod` is `true`, pods with Non-Zero `.spec.deletionTimestamp` will be ignored. +ResourcePolicies will never match pods in different namesapces. One pod can not be matched by more than one Resource Policies. + +Pods can only be scheduled on units defined in `.spec.units` and this behavior can be changed by `.spec.strategy`. Each item in `.spec.units` contains a set of nodes that match the `NodeSelector` which describes a kind of resource in the cluster. + +`.spec.units[].priority` define the priority of each unit. Units with higher priority will get higher score in the score plugin. If all units have the same priority, resourcepolicy will only limit the max pod on these units. +If the `.spec.units[].priority` is not set, the default value is 0. +`.spec.units[].maxCount` define the maximum number of pods that can be scheduled on each unit. If `.spec.units[].maxCount` is not set, pods can always be scheduled on the units except there is no enough resource. -`Strategy` indicate how we treat the nodes doesn't match any unit. +`.spec.strategy` indicate how we treat the nodes doesn't match any unit. If strategy is `required`, the pod can only be scheduled on nodes that match the units in resource policy. If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not match the units will be considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable` for those nodes not match the units. -`MatchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like -`MatchLabelKeys` in `PodTopologySpread`. - -`matchPolicy` indicate if we should ignore some kind pods when calculate pods in certain unit. - -If `forceMaxNum` is set `true`, we will not try the next units when one unit is not full, this property have no effect -when `max` is not set in units. +`.spec.matchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like +`.spec.matchLabelKeys` in `PodTopologySpread`. ### Implementation Details -#### Scheduler Plugins - -For each unit, we will record which pods were scheduled on it to prevent too many pods scheduled on it. - -##### PreFilter +#### PreFilter PreFilter check if the current pods match only one resource policy. If not, PreFilter will reject the pod. If yes, PreFilter will get the number of pods on each unit to determine which units are available for the pod and write this information into cycleState. -##### Filter +#### Filter Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return -success if the strategy is `prefer`, otherwise we will return unschedulable. +success if the `.spec.strategy` is `prefer`, otherwise we will return unschedulable. Besides, filter will check if the pods that was scheduled on the unit has already violated the quantity constraint. -If the number of pods has reach the `maxCount`, all the nodes in unit will be marked unschedulable. +If the number of pods has reach the `.spec.unit[].maxCount`, all the nodes in unit will be marked unschedulable. -##### Score -If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum -priority is 1. +#### Score +If `.spec.unit[].priority` is set in resource policy, we will schedule pod based on `.spec.unit[].priority`. Default priority is 0, and minimum +priority is 0. Score calculation details: -1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum -score. +1. calculate priority score, `scorePriority = (priority) * 20`, to make sure we give nodes without priority a minimum score. 2. normalize score #### Resource Policy Controller