Skip to content

Conversation

@kamarabbas99
Copy link
Contributor

@kamarabbas99 kamarabbas99 commented Nov 13, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

The CPU startup boost changes were done on experimental branch, moving this to master.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Users can configure a startupBoost policy in the VPA spec. 

Which issue(s) this PR fixes:

#7862

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost#aep-7862-cpu-startup-boost)

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/vertical-pod-autoscaler and removed do-not-merge/needs-area labels Nov 13, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 13, 2025
@kamarabbas99
Copy link
Contributor Author

/cc adrianmoisey omerap12

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! we already approved this in previous PRs 🥳
/lgtm
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 13, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2025
@omerap12
Copy link
Member

oh good catch
/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2025
Update VPA version for startupboost feature
@adrianmoisey
Copy link
Member

I'm good with this, thanks for doing it!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2025
@omerap12
Copy link
Member

/approve

@soltysh
Copy link
Contributor

soltysh commented Nov 19, 2025

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Nov 19, 2025
@soltysh soltysh moved this to Backlog in API Reviews Nov 19, 2025
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left there a few API related questions, but I didn't get too deep into the logic itself, other than just the validation bits.

maxAllowedCpu: resource.QuantityValue{},
featureGateEnabled: true,
expectError: fmt.Errorf("boost factor must be >= 1"),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only error case you're testing, I'd add additional ones for type and factor|quantity specified, and probably also one for invalid type, even though that should be caught at the api server validation, but it doesn't hurt to have a test case covering that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing, the WithCPUStartupBoost method only modifies the VPA.spec, but doesn't change the container level StartupBoost, it would be good to also cover that case in test, to ensure the latter takes priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between the two, I missed that in the AEP-7862 and the code seems to be working on one or the other, depending on the location?

Copy link
Contributor Author

@kamarabbas99 kamarabbas99 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more error cases in #8828. should it error out when both quantity and Factor are set ? Currently it doesnt.

contianer level startup boost has higher priority, added some test cases to cover that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, just tagged #8828, but it would be also nice to update AEP-7862 with this information.

Copy link
Contributor Author

@kamarabbas99 kamarabbas99 Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually AEP does mention it.

The new StartupBoost parameter will be added to both:

[VerticalPodAutoscalerSpec]: Will allow users to specify the default CPU startup boost for all containers of the pod targeted by the VPA object.
[ContainerResourcePolicy]: Will allow users to optionally customize the startup boost behavior for individual containers.

kamarabbas99 added a commit to kamarabbas99/autoscaler that referenced this pull request Nov 19, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2025
@alextarasov-spot
Copy link

Hello, I’m not sure if this is the right place to report this issue, so I’ll share it here. If there’s a more appropriate channel, please let me know.

I cloned the experimental-cpu-boost-v2 branch to test the new feature, and I’m concerned about the following behavior:

  • Boost works for one replica
  • Unboost does not work with a single replica, meaning the pod stays stuck with the boosted CPU request.
1 pods_inplace_restriction.go:112] "Checking if pod can be unboosted" pod="vpa-test/nginx-all-containers-5d649d7c56-t7bqj" durationPassed=true hasAnnotation=true
1 pods_restriction_factory.go:212] "Too few replicas" kind="ReplicaSet" object="vpa-test/nginx-all-containers-5d649d7c56" livePods=1 requiredPods=2 globalMinReplicas=2

The Minimum number of replicas to perform an update is 2 by default

minReplicas = flag.Int("min-replicas", 2,

This can be confusing, as boosting is allowed with a single replica, but unboosting will never occur in that case.

@kamarabbas99
Copy link
Contributor Author

@alextarasov-spot thanks for catching this!
sent #8854 to address this.

@omerap12
Copy link
Member

@alextarasov-spot , great catch! thanks for this!

Allow unboost even if pod replicas less than "min-replicas" flag
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kamarabbas99, omerap12
Once this PR has been reviewed and has the lgtm label, please ask for approval from adrianmoisey. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alextarasov-spot
Copy link

@kamarabbas99 @omerap12 thank you guys for the quick reaction!

If you don't mind, I will share another issue I found.

Input data

  • resources request
resources:
            requests:
              cpu: "100m"
  • VPA definition
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: nginx
      controlledValues: RequestsAndLimits
  startupBoost:
    cpu:
      duration: 15s
      quantity: 200m
      type: Quantity
  ......
  ......
  updatePolicy:
    updateMode: InPlaceOrRecreate
status:
  recommendation:
    containerRecommendations:
    - containerName: nginx
      target:
        cpu: 105m

Which leads to the following error:

Error creating: Pod "nginx-with-probes-6fb6445758-ldw5m"
  is invalid: spec.containers[0].resources.requests: Invalid value: "305m": must be
  less than or equal to cpu limit of 300m'

@kamarabbas99
Copy link
Contributor Author

kamarabbas99 commented Nov 25, 2025

@alextarasov-spot did you manually set the target? Because I thought when controlledValues is RequestsAndLimits, the recommender will set limits as well.

@omerap12
Copy link
Member

@kamarabbas99 I think it would be helpful to document this use case in the AEP (where no limits are defined for a pod).
In situations where no original limit is set, shouldn’t the limit be based on the boosted request when no original limit exists?

@omerap12
Copy link
Member

Having no limit does make sense :) as long as we document this.

@alextarasov-spot
Copy link

@kamarabbas99

@alextarasov-spot did you manually set the target? Because I thought when controlledValues is RequestsAndLimits, the recommender will set limits as well.

Yes, I manually configured the VPA target. If RequestsAndLimits is set, the VPA admission controller will calculate the limit, but it doesn't factor in the boosted value. I think that when there are no original limits set AND a boost is configured, AND the controlledValues is RequestsAndLimits, the admission controller should calculate the limit based on the boosted value and not on the original request value

I think that @omerap12 meant the same:

shouldn’t the limit be based on the boosted request when no original limit exists?

@kamarabbas99
Copy link
Contributor Author

@alextarasov-spot I am addressing this in #8863 PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

9 participants