@@ -181,17 +181,6 @@ any reported failures would be as likely to be the result of a bug in
181181the checking code as they would be to indicate a bug in the actual
182182syncing code.
183183
184- ```
185- <<[UNRESOLVED]>>
186-
187- We need to decide whether we are going to try to implement such a
188- check and associated metric. (For the initial Alpha release, we will
189- not; if we decide to add it later that may postpone the transition to
190- Beta.)
191-
192- <<[/UNRESOLVED]>>
193- ```
194-
195184## Design Details
196185
197186### Test Plan
@@ -244,18 +233,10 @@ appropriate code for that.
244233
245234#### Beta
246235
247- - If we decide that we need to implement the "partial sync gives
248- different result than full sync" metric, then we will not go to Beta
249- until at least one release has passed with that metric available in
250- Alpha.
251-
252236- No new bugs
253237- Feature actually improves performance (eg, in the ` gce-5000Nodes `
254238 periodic job).
255- - Real-world users trying out the Alpha feature have reported both
256- (a) no new bugs, and (b) improved performance.
257- - General e2e tests modified to check the "partial sync failures"
258- metric at some point or points during the test run, TBD.
239+ - Additional metrics to distinguish partial and full resync times
259240
260241#### GA
261242
@@ -280,7 +261,7 @@ ways.
280261
281262###### How can this feature be enabled / disabled in a live cluster?
282263
283- - [ ] Feature gate (also fill in values in ` kep.yaml ` )
264+ - [X ] Feature gate (also fill in values in ` kep.yaml ` )
284265 - Feature gate name: ` MinimizeIPTablesRestore `
285266 - Components depending on the feature gate:
286267 - kube-proxy
@@ -308,19 +289,23 @@ Nothing different than enabling it the first time.
308289
309290No, but:
310291
311- - even with the new code enabled, kube-proxy will always do a full
292+ - Even with the new code enabled, kube-proxy will always do a full
312293 sync the first time it syncs after startup (regardless of the
313294 pre-existing iptables state), so a test that switched from
314295 "disabled" to "enabled" would not really test anything different
315296 than a test that just brought up the cluster with the feature
316297 enabled in the first place.
317298
318- - even with the new code enabled, kube-proxy will periodically do a
299+ - Even with the new code enabled, kube-proxy will periodically do a
319300 full resync, so a test that switched from "enabled" to "disabled"
320301 would not really test anything that wouldn't also be tested by just
321302 letting kube-proxy run for a while with the feature enabled. (Where
322303 "a while" is much less than the length of an e2e run.)
323304
305+ - The new code does not make any API writes (or change any other
306+ persistent cluster or node state besides the iptables rules), so
307+ there is nothing else that would need to be validated.
308+
324309### Rollout, Upgrade and Rollback Planning
325310
326311###### How can a rollout or rollback fail? Can it impact already running workloads?
@@ -344,17 +329,19 @@ rules.
344329
345330###### What specific metrics should inform a rollback?
346331
347- We should add metrics to catch the failure modes noted above in [ Risks
348- and Mitigations] ( #risks-and-mitigations ) ; in particular we should make
349- it noticeable when failed partial resyncs occur.
332+ The new kube-proxy
333+ ` sync_proxy_rules_iptables_partial_restore_failures_total ` metric
334+ indicates when kube-proxy is falling back from a partial resync to a
335+ full resync due to an unexpected ` iptables-restore ` error. While it may
336+ eventually be non-0 due to unrelated unexpected ` iptables-restore `
337+ errors, it should not ever be steadily increasing, and if it was, that
338+ would likely indicate a bug in the code.
350339
351340###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
352341
353- <!--
354- Describe manual testing that was done and the outcomes.
355- Longer term, we may want to require automated upgrade/rollback tests, but we
356- are missing a bunch of machinery and tooling and can't do that now.
357- -->
342+ No, but as described above, the ordinary operation of kube-proxy with
343+ the feature enabled includes behavior that is equivalent to what would
344+ happen during an upgrade or rollback.
358345
359346###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
360347
@@ -376,33 +363,71 @@ is enabled, then it is in use.
376363
377364###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
378365
379- TBD
380-
381366Sync performance should be much better in large clusters, and should
382- be somewhat independent of cluster size. We will have better answers
383- after it sees more real-world(-ish) use.
384-
385- ` sync_proxy_rules_duration_seconds `
386- ` network_programming_duration_seconds `
387- ` sync_proxy_rules_endpoint_changes_pending `
388- ` sync_proxy_rules_service_changes_pending `
389- ` sync_proxy_rules_last_queued_timestamp_seconds ` vs ` sync_proxy_rules_last_timestamp_seconds `
390-
367+ be somewhat independent of cluster size.
368+
369+ ![ graph of performance over time] ( 5000Node-perf.png )
370+
371+ Enabling this feature on the 5000-node scalability test immediately
372+ resulted in a 2x improvement to the
373+ ` network_programming_duration_seconds ` (aka
374+ ` NetworkProgrammingLatency ` ) metric in the 50th, 90th, and 99th
375+ percentile buckets, with the 50th percentile going from about 27
376+ seconds to about 13 (the first drop in the graph above).
377+
378+ On further investigation, we realized that it could theoretically have
379+ improved further, but the tests were running kube-proxy with
380+ ` --iptables-min-sync-period 10s ` , which introduced its own latency.
381+ Dropping them to ` --iptables-min-sync-period 1s ` (the default) caused
382+ a further drop in the 50th percentile latency, to about 5 seconds (the
383+ second drop in the graph above) though at a cost of increased CPU
384+ usage (since ` syncProxyRules() ` , the heart of kube-proxy, runs 10
385+ times as often now). (90th and 99th percentile latency did not change,
386+ because they are more dominated by the time it takes to do _ full_
387+ resyncs. We realized that we should probably add separate partial/full
388+ sync time metrics to get more fine-grained information.)
389+
390+ For 1.26 we added a new section to the kube-proxy documentation
391+ describing the kube-proxy ` --sync-period ` and
392+ ` --iptables-min-sync-period ` options ([ website #28435 ] ), which also
393+ mentions that the ` MinimizeIPTablesRestore ` feature reduces the need
394+ for ` --iptables-min-sync-period ` and that administrators using that
395+ feature should try out smaller min-sync-period values. For 1.27, we
396+ should add a note about the latency vs CPU usage tradeoff, and we
397+ should make sure to have a release note about this as well.
398+
399+ [ website #28435 ] : https://github.com/kubernetes/website/pull/38435
391400
392401###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
393402
394403- [X] Metrics
395- - Metric name: sync_proxy_rules_iptables_partial_restore_failures_total
396- - [ Optional] Aggregation method: ??? (I don't know what this means)
404+ - Metric names:
405+ - sync_proxy_rules_iptables_partial_restore_failures_total
406+ - sync_full_proxy_rules_duration_seconds
407+ - sync_partial_proxy_rules_duration_seconds
397408 - Components exposing the metric:
398409 - kube-proxy
399410
400411###### Are there any missing metrics that would be useful to have to improve observability of this feature?
401412
402- <!--
403- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
404- implementation difficulties, etc.).
405- -->
413+ As of 1.26, there are no metrics for distinguishing the average
414+ partial sync time from the average full sync time, which would help
415+ administrators to understand kube-proxy's performance better and
416+ figure out a good ` --iptables-min-sync-period ` value. We will add
417+ those metrics for beta.
418+
419+ As discussed above under [ Subtle Synchronization
420+ Delays] ( #subtle-synchronization-delays ) , it would theoretically be
421+ possible to have kube-proxy check the results of the partial restores
422+ against an equivalent full restore, to ensure that it got the result
423+ it was expecting. However, as also discussed there, this would be
424+ quite difficult to implement, and would involve much more new code
425+ than the actual partial-restores feature. Given that, and the fact
426+ that e2e testing during Alpha has not turned up any problems with the
427+ feature, it seems that if we tried to sanity check the rules in that
428+ way, a failure would be more likely to indicate a bug in the
429+ rule-sanity-checking code than a bug in the rule-generation code.
430+ Thus, we are not implementing such a metric.
406431
407432### Dependencies
408433
@@ -437,15 +462,17 @@ take by operations related to kube-proxy SLIs/SLOs.
437462
438463###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
439464
440- There should be no effect on disk or IO .
465+ There should be no effect on disk, IO, or RAM .
441466
442- There should be a major _ decrease_ in CPU usage by kube-proxy's
443- process group (not by kube-proxy itself, but by the ` iptables-restore `
444- processes it spawns.)
467+ The feature itself should result in a _ decrease_ in the CPU usage of
468+ kube-proxy's process group, since the ` iptables-restore ` processes it
469+ spawns will have less work to do.
445470
446- The current PR would not result in any non-negligible increase in
447- kube-proxy RAM usage, although some variations of it might involve
448- keeping track of more information in memory between resyncs.
471+ (If administrators also follow the recommendation to lower
472+ ` --iptables-min-sync-period ` , then that will _ increase_ kube-proxy CPU
473+ usage, since kube-proxy will be waking up to resync more often. But
474+ this is a CPU usage increase that the administrator is more-or-less
475+ explicitly requesting.)
449476
450477### Troubleshooting
451478
@@ -455,28 +482,19 @@ It does not change kube-proxy's behavior in this case.
455482
456483###### What are other known failure modes?
457484
458- ( No _ known _ failure modes yet . See [ Risks and
485+ No known failure modes. See [ Risks and
459486Mitigations] ( #risks-and-mitigations ) for theorized potential failure
460- modes.)
461-
462- <!--
463- For each of them, fill in the following information by copying the below template:
464- - [Failure mode brief description]
465- - Detection: How can it be detected via metrics? Stated another way:
466- how can an operator troubleshoot without logging into a master or worker node?
467- - Mitigations: What can be done to stop the bleeding, especially for already
468- running user workloads?
469- - Diagnostics: What are the useful log messages and their required logging
470- levels that could help debug the issue?
471- Not required until feature graduated to beta.
472- - Testing: Are there any tests for failure mode? If not, describe why.
473- -->
487+ modes.
474488
475489###### What steps should be taken if SLOs are not being met to determine the problem?
476490
477491## Implementation History
478492
479493- 2022-08-02: Initial proposal
494+ - 2022-10-04: Initial KEP merged
495+ - 2022-11-01: Code PR merged
496+ - 2022-11-10: Testing PRs merged; 5000Nodes test begins enabling the feature
497+ - 2022-12-08: Kubernetes 1.26 released
480498
481499## Drawbacks
482500
0 commit comments