Skip to content

Commit 0618c46

Browse files
committed
sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails
JIRA: https://issues.redhat.com/browse/RHEL-110301 commit 155213a Author: Chris Mason <clm@fb.com> Date: Thu Jun 26 07:39:10 2025 -0700 sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails schbench (https://github.com/masoncl/schbench.git) is showing a regression from previous production kernels that bisected down to: sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7e) The schbench command line was: schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0 This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker threads spread across the rest of the CPUs. Neither the worker threads or the message threads do any work, they just wake each other up and go back to sleep as soon as possible. The end result is the first 4 CPUs are pegged waking up those 1024 workers, and the rest of the CPUs are constantly banging in and out of idle. If I take a v6.9 Linus kernel and revert that one commit, performance goes from 3.4M RPS to 5.4M RPS. schedstat shows there are ~100x more new idle balance operations, and profiling shows the worker threads are spending ~20% of their CPU time on new idle balance. schedstats also shows that almost all of these new idle balance attemps are failing to find busy groups. The fix used here is to crank up the cost of the newidle balance whenever it fails. Since we don't want sd->max_newidle_lb_cost to grow out of control, this also changes update_newidle_cost() to use sysctl_sched_migration_cost as the upper limit on max_newidle_lb_cost. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Signed-off-by: Phil Auld <pauld@redhat.com>
1 parent 2e97e2d commit 0618c46

File tree

1 file changed

+16
-3
lines changed

1 file changed

+16
-3
lines changed

kernel/sched/fair.c

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12052,8 +12052,14 @@ static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
1205212052
/*
1205312053
* Track max cost of a domain to make sure to not delay the
1205412054
* next wakeup on the CPU.
12055+
*
12056+
* sched_balance_newidle() bumps the cost whenever newidle
12057+
* balance fails, and we don't want things to grow out of
12058+
* control. Use the sysctl_sched_migration_cost as the upper
12059+
* limit, plus a litle extra to avoid off by ones.
1205512060
*/
12056-
sd->max_newidle_lb_cost = cost;
12061+
sd->max_newidle_lb_cost =
12062+
min(cost, sysctl_sched_migration_cost + 200);
1205712063
sd->last_decay_max_lb_cost = jiffies;
1205812064
} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
1205912065
/*
@@ -12745,10 +12751,17 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
1274512751

1274612752
t1 = sched_clock_cpu(this_cpu);
1274712753
domain_cost = t1 - t0;
12748-
update_newidle_cost(sd, domain_cost);
12749-
1275012754
curr_cost += domain_cost;
1275112755
t0 = t1;
12756+
12757+
/*
12758+
* Failing newidle means it is not effective;
12759+
* bump the cost so we end up doing less of it.
12760+
*/
12761+
if (!pulled_task)
12762+
domain_cost = (3 * sd->max_newidle_lb_cost) / 2;
12763+
12764+
update_newidle_cost(sd, domain_cost);
1275212765
}
1275312766

1275412767
/*

0 commit comments

Comments
 (0)