diff --git a/keps/sig-api-machinery/5661-propogate-ownerRefrences/README.md b/keps/sig-api-machinery/5661-propogate-ownerRefrences/README.md new file mode 100644 index 00000000000..3972e64925c --- /dev/null +++ b/keps/sig-api-machinery/5661-propogate-ownerRefrences/README.md @@ -0,0 +1,820 @@ +# KEP-NNNN: Propagate OwnerReferences from PodTemplateSpec to Pods + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +`GetPodFromTemplate` currently discards any `ownerReferences` defined in a `PodTemplateSpec`, replacing them with only the controller’s own `controllerRef`. This prevents workload authors from declaring additional ownership or “belongs-to” relationships between the Pods and other Kubernetes objects. +This proposal introduces support for propagating non-controller `ownerReferences` from `PodTemplateSpec` to the Pods it spawns, enabling better visibility and grouping of related resources by controllers and schedulers. The behavior will be gated behind a new feature flag, `PropagateOwnerReferences`. + +## Motivation + +When controllers such as ReplicaSets, StatefulSets, or Jobs create Pods, they automatically assign an `ownerReference` with `controller: true`, establishing a primary control relationship between the controller and its Pods. This ensures correct garbage collection and lifecycle management. +However, the current behavior in `GetPodFromTemplate` discards any `ownerReferences` defined within the `PodTemplateSpec`. This omission prevents workload authors and higher-level systems from expressing additional “belongs-to” or “associated-with” relationships between Pods and other Kubernetes objects. +In many real-world use cases, Pods are not only managed by a single controller but also conceptually part of broader systems or experiments. For example, a workflow engine, deployment manager, or AI experiment runner might want to associate Pods with higher-level grouping objects for tracking, cleanup, or visibility. Without support for propagating `ownerReferences` (with `controller: false`), these relationships must be maintained out-of-band, leading to fragmented ownership semantics and operational complexity. +Allowing propagation of non-controller `ownerReferences` from a `PodTemplateSpec` to the created Pods closes this gap. It enables declarative, consistent, and introspectable relationships between Pods and other Kubernetes objects, while maintaining the invariant that each Pod has only one controller owner. + +### Goals + +- Enable propagation of non-controller `ownerReferences` defined in a `PodTemplateSpec` to the Pods created from it. +- Allow workload authors and higher-level systems to declaratively establish additional ownership or association relationships between Pods and other Kubernetes objects. +- Maintain consistent garbage collection and lifecycle semantics — Pods should still have exactly one controller owner (`controller: true`) but may include additional non-controller owners. +- Ensure this propagation logic is implemented in a controlled, backward-compatible manner within `GetPodFromTemplate` and related controller utilities. +- Provide visibility and traceability across Kubernetes APIs by allowing users and tools to inspect extended ownership relationships directly via the `metadata.ownerReferences` field. + +### Non-Goals + +- Changing or relaxing the rule that a Pod may have only one ownerReference with `controller: true`. +- Introducing any new garbage collection behaviors or modifying existing GC rules in Kubernetes. +- Propagating arbitrary metadata or fields other than `ownerReferences` from templates. +- Redefining how higher-level controllers (e.g., ReplicaSet, StatefulSet, Job) determine pod ownership or reconcile pods. +- Automatically inferring or generating non-controller owner references—only explicit entries in the `PodTemplateSpec` should be propagated. + +## Proposal + +When controllers create Pods from a PodTemplateSpec, they typically invoke GetPodFromTemplate() in pkg/controller/controller_utils.go. +This helper constructs a new v1.Pod object and assigns metadata such as labels, annotations, finalizers, and a controllerRef pointing to the parent controller. +Currently, any ownerReferences defined within the PodTemplateSpec are dropped during this process. + +This proposal aims to modify GetPodFromTemplate() to propagate non-controller ownerReferences from the PodTemplateSpec to the resulting Pods, while preserving existing controller behavior. + +This enhancement allows workload and extension authors to declaratively define additional ownership relationships for Pods—enabling better integration with higher-level controllers, custom schedulers, and resource management systems—without altering existing controller semantics. + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + +### 1. Pods may be garbage collected earlier than expected + +**Risk:** +If a propagated non-controller owner is deleted, its dependent Pods may be garbage collected prematurely. + +**Mitigation:** +This behavior will only be enabled when the `PropagateOwnerReferences` feature gate is set to `true`. Documentation and release notes will clearly describe the ownership propagation semantics so that controllers and users can opt in with full understanding of GC effects. + +--- + +### 2. Validation errors from multiple `isController: true` references + +**Risk:** +If the `PodTemplateSpec` includes an `ownerReference` marked as `isController: true`, it will conflict with the `controllerRef` that the controller sets, causing Pod creation to fail validation. + +**Mitigation:** +The implementation will automatically ignore or drop any template `ownerReference` with `isController: true` to preserve existing validation guarantees. The feature will strictly propagate only non-controller owner references. + +--- + +### 3. Changes in GC propagation delaying cleanup + +**Risk:** +If Pods now have additional non-controller `ownerReferences`, the garbage collector may delay their deletion until all referenced owners are deleted, potentially prolonging Pod lifecycle. + +**Mitigation:** +The behavior will be documented clearly as part of feature usage guidelines. Because this propagation is purely additive and opt-in, existing controllers will not experience changed GC timing unless they explicitly use this feature. + +--- + +### 4. Compatibility and rollout risk + +**Risk:** +Unintended behavior could occur in workloads that reuse `PodTemplateSpec`s across controllers or rely on implicit cleanup semantics. + +**Mitigation:** +- The feature is introduced behind a feature gate (`PropagateOwnerReferences`) and will start in **alpha**, allowing cluster operators to safely test and disable it if needed. +- Extensive unit and integration tests will ensure compatibility with garbage collection and `controllerRef` validation logic. + +## Design Details + +### Overview + +The existing `GetPodFromTemplate()` function, located in `pkg/controller/controller_utils.go`, is a shared utility used by multiple workload controllers (e.g., Deployment, ReplicaSet, Job) to create `Pod` objects from a `PodTemplateSpec`. +Currently, this helper copies labels, annotations, and finalizers from the template but **discards any existing `OwnerReferences`** defined in the `PodTemplateSpec`. +As a result, only the controller’s own `controllerRef` is attached to the created `Pod`, preventing workload authors from expressing additional ownership relationships. + +### Proposed Change + +When the `PropagateOwnerReferences` feature gate is enabled, `GetPodFromTemplate()` will be modified to **merge non-controller `OwnerReferences`** from the `PodTemplateSpec` into the generated `Pod`. + +The proposed logic is as follows: + +1. Preserve existing behavior for labels, annotations, finalizers, and the controller’s `controllerRef`. +2. From the `template.ObjectMeta.OwnerReferences` list, **filter out any references where `isController: true`**. +3. Append all remaining non-controller `OwnerReferences` to the Pod’s `OwnerReferences` list after adding the controllerRef (if provided). +4. If the feature gate is disabled, or no additional owner references exist, the function behaves exactly as before. + +This ensures that: +- The Pod continues to have **exactly one controller** (the parent controllerRef). +- Additional ownership links (e.g., grouping by a higher-level object or logical hierarchy) can be declared safely. +- Garbage Collection (GC) and validation semantics remain unchanged for existing workloads. + +### Example Behavior + +Given a `PodTemplateSpec` like this: + +```yaml +metadata: + ownerReferences: + - apiVersion: "batch/v1" + kind: "JobGroup" + name: "analytics-job-group" + uid: "abcd-1234" + controller: false +``` + +When the controller creates a `Pod` using `GetPodFromTemplate()` with controllerRef pointing to the `Job`, the resulting `Pod` will include both references: + +```yaml + metadata: + ownerReferences: + - apiVersion: "batch/v1" + kind: "Job" + name: "data-job" + uid: "xyz-789" + controller: true + - apiVersion: "batch/v1" + kind: "JobGroup" + name: "analytics-job-group" + uid: "abcd-1234" + controller: false +``` + +This allows tools, schedulers, and GC to recognize both the direct controller relationship and the broader group membership defined by the workload author. + +### Feature Gate + +The change will be gated by a new alpha feature gate: +```go +PropagateOwnerReferences: { + Default: false, + PreRelease: featuregate.Alpha, +} +``` + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-api-machinery/5661-propogate-ownerRefrences/kep.yaml b/keps/sig-api-machinery/5661-propogate-ownerRefrences/kep.yaml new file mode 100644 index 00000000000..f898ab7691b --- /dev/null +++ b/keps/sig-api-machinery/5661-propogate-ownerRefrences/kep.yaml @@ -0,0 +1,32 @@ +title: Propagate OwnerReferences from PodTemplateSpec to Pods +kep-number: 5661 +authors: + - "@itzPranshul" +owning-sig: sig-apimachinery +participating-sigs: + - sig-apimachinery + - sig-apps +status: provisional +creation-date: 2025-10-31 +reviewers: + - TBD +approvers: + - TBD + +stage: alpha + +latest-milestone: "v1.36" + +milestone: + alpha: "v1.36" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: PropogateOwnerReferences + components: + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []