Skip to content

Conversation

@alcampag
Copy link
Member

What this PR does / why we need it:
This PR introduces a new field for OCIManagedCluster that allows for the network to be created in a different compartment than the one specified in OCIManagedCluster.Spec.CompartmentId

The new field is called OCIManagedCluster.Spec.NetworkSpec.CompartmentId, it is optional and if not set it will default to the only compartment specified.

In addition, this PR fixes many bugs on the managed machine pool controller. The controller logic caused continues updates on the node pool if the original OCIManagedCluster was modified. This is because not all fields were set as the actual managed node pool state and updated correctly.

The new reconciliation logic is following this design:

  • If the user has defined a value for a field in the OCIManagedCluster object, then the controller will try to synchronize it
  • If the field is not specified entirely, the controller will try to keep it as defined in the actual state of the node pool

Which issue(s) this PR fixes:
Fixes #450

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 26, 2025
@joekr joekr added bug Something isn't working enhancement New feature or request labels Sep 26, 2025
@joekr
Copy link
Member

joekr commented Sep 29, 2025

What happens if a user 1) updates the network compartment ID to a new compartment or 2) deletes the network compartment id?

@joekr
Copy link
Member

joekr commented Sep 29, 2025

Instead of calling out all the places tests need to be added I'll just make the blanket statement here. We need to make sure we have some unit test coverage for this update. I know we talked about adding e2e tests around this already as well, but I'm just putting this in the comment for posterity.

@alcampag
Copy link
Member Author

I will need to think how to handle the networkCompartmentId update/deletion.

@joekr
Copy link
Member

joekr commented Sep 30, 2025

I will need to think how to handle the networkCompartmentId update/deletion.

We talked about this and we came to the conclusion that update/delete of the compartmentID won't work as it is setup now. We won't address this as part of this PR. At somepoint in the future we will induce this but it will need to leverage https://docs.oracle.com/en-us/iaas/api/#/en/iaas/latest/Vcn/ChangeVcnCompartment. This will be a bigger lift than just adding it as part of this PR.

alcampag and others added 5 commits October 2, 2025 14:37
- Introduce new cluster template for OCIManagedCluster with separate network compartment.
- Update e2e_conf.yaml to include the template and OCI_NETWORK_COMPARTMENT_ID variable.
- Modify Makefile to generate the new template (replacing node-recycling build).
- Add --load flag to docker-build for local image loading.
- Create cluster.yaml defining the managed cluster and control plane specs.
@alcampag alcampag marked this pull request as ready for review October 16, 2025 07:30
@alcampag alcampag requested a review from joekr October 16, 2025 07:30
@alcampag
Copy link
Member Author

@joekr I have added the e2e template and merged with the latest release. If you see a test failing locally, it's because @vladcristi has implemented a new unit test for instance principals, which will fail on a local environment.
Only thing missing is the e2e for the NetworkCompartmentId, but I have already done my share of tests manually with different configurations.

$(KUSTOMIZE) build $(OCI_TEMPLATES)/v1beta2/cluster-template-externally-managed-vcn --load-restrictor LoadRestrictionsNone > $(OCI_TEMPLATES)/v1beta2/cluster-template-externally-managed-vcn.yaml
$(KUSTOMIZE) build $(OCI_TEMPLATES)/v1beta2/cluster-template-machine-pool --load-restrictor LoadRestrictionsNone > $(OCI_TEMPLATES)/v1beta2/cluster-template-machine-pool.yaml
$(KUSTOMIZE) build $(OCI_TEMPLATES)/v1beta2/cluster-template-managed --load-restrictor LoadRestrictionsNone > $(OCI_TEMPLATES)/v1beta2/cluster-template-managed.yaml
$(KUSTOMIZE) build $(OCI_TEMPLATES)/v1beta2/cluster-template-managed-node-recycling --load-restrictor LoadRestrictionsNone > $(OCI_TEMPLATES)/v1beta2/cluster-template-managed-node-recycling.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be deleted?

if err == nil && resp.Subnet.DisplayName != nil {
return *resp.Subnet.DisplayName
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add API calls into the util methods. 🤔

I think most things in this layer are really treating things as in-memory calls. What if we added these API calls to the respective controller? Would that solve the same issues being solve here?

spec:
compartmentId: "${OCI_COMPARTMENT_ID}"
networkSpec:
compartmentId: "${OCI_NETWORK_COMPARTMENT_ID}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: do you mind adding a simple comment that calls out this line since this is the major change that we are testing?

if err := r.reconcileComponent(ctx, ociManagedCluster, clusterScope.ReconcileVCN, "VCN",
infrastructurev1beta2.VcnReconciliationFailedReason, infrastructurev1beta2.VcnEventReady); err != nil {
if err := r.reconcileComponent(ctx, ociManagedCluster, clusterScope.ReconcileDRG, "DRG",
infrastructurev1beta2.DrgReconciliationFailedReason, infrastructurev1beta2.DrgEventReady); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of reordering this?

}

func (m *ManagedMachinePoolScope) getPodSubnets(subnets []string) []string {
func (m *ManagedMachinePoolScope) getNetworkCompartmentId() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a method comment. Something like

// getNetworkCompartmentId return compartmentId of the NetworkSpec if set, otherwise defaults to the OCIManagedCluster compartmentId

}
subnet.ID = &subnetId
}
subnetList = append(subnetList, *subnet.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what we want to do here. Originally we were pulling the spec pod subnets, which I believe get updated. So wouldn't the subnets be up to date here when reconciled?

If I'm reading this correctly if the subnet id in the spec is nil this will make API calls to get the subnet info. I understand that, but I guess I'm trying to understand why the subnet ID would be nil even after the reconcile. I'm probably not understanding something.

subnetList = podDetails.PodSubnetIds
}
}
return subnetList, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no subnets found and we have an existing node pool

I don't understand how is this possible? Without a subnet we couldn't have a nodepool correct? again, probably not something I'm understanding.

if len(nsgList) == 0 && pool != nil && pool.NodeConfigDetails != nil {
if podDetails, ok := pool.NodeConfigDetails.NodePoolPodNetworkOptionDetails.(oke.OciVcnIpNativeNodePoolPodNetworkOptionDetails); ok {
nsgList = podDetails.PodNsgIds
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no NSGs found and we have an existing node pool, use the NSGs from the node pool

We will want to make sure this change is documented for the users.

// The main reconciliation loop for the node pool is triggered by reflect.DeepEqual(spec, actual). While it's correct to update a node pool if the spec has changed,
// this can lead to an infinite update loop for some parameters that are not specified in spec, but they are returned by the actual state of the node pool.
// To solve this, spec is updated before the reflect.DeepEqual(spec, actual) only for those problematic parameters
func setSpecFromActual(spec *infrav2exp.OCIManagedMachinePoolSpec, actual *expinfra1.OCIManagedMachinePoolSpec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to become a nightmare to keep up to date if there are any changes to machinepool. There has to be a better way to do this. This seems very brittle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mutating the spec at the very least. I wonder if we should at least make a copy instead of mutating the spec 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like instead compute a minimal “desired patch” (only fields the user specified) and trigger UpdateNodePool only if that patch has at least one field set.

I would have put this comment up in the UpdateNodePool method, but since the context is all here I thought I would put it here.

s.OCIClusterAccessor.GetNetworkSpec().Vcn.ID = vcn.Id
// Someone has updated the network compartment ID, or the VCN was moved. Return an error
if vcn.CompartmentId != nil && *vcn.CompartmentId != s.GetNetworkCompartmentId() {
return errors.New("CompartmentId of the VCN is not the same as the one specified in the spec")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably try to catch this at the webhook layer. We can also leave this, but if the user wants to update it, the webhook might be catch it before even applying the change. Might be a better UX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possibility to provision OCI Managed Control Plane and network on different compartments

2 participants