Skip to content

Commit f2cdf89

Browse files
committed
powerpc/watchdog: Fix missed watchdog reset due to memory ordering race
jira LE-1907 Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2 commit-author Nicholas Piggin <npiggin@gmail.com> commit 5dad4ba It is possible for all CPUs to miss the pending cpumask becoming clear, and then nobody resetting it, which will cause the lockup detector to stop working. It will eventually expire, but watchdog_smp_panic will avoid doing anything if the pending mask is clear and it will never be reset. Order the cpumask clear vs the subsequent test to close this race. Add an extra check for an empty pending mask when the watchdog fires and finds its bit still clear, to try to catch any other possible races or bugs here and keep the watchdog working. The extra test in arch_touch_nmi_watchdog is required to prevent the new warning from firing off. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com (cherry picked from commit 5dad4ba) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
1 parent 1a31244 commit f2cdf89

File tree

1 file changed

+40
-1
lines changed

1 file changed

+40
-1
lines changed

arch/powerpc/kernel/watchdog.c

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
139139
{
140140
cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
141141
cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
142+
/*
143+
* See wd_smp_clear_cpu_pending()
144+
*/
145+
smp_mb();
142146
if (cpumask_empty(&wd_smp_cpus_pending)) {
143147
wd_smp_last_reset_tb = tb;
144148
cpumask_andnot(&wd_smp_cpus_pending,
@@ -225,13 +229,44 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
225229

226230
cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
227231
wd_smp_unlock(&flags);
232+
} else {
233+
/*
234+
* The last CPU to clear pending should have reset the
235+
* watchdog so we generally should not find it empty
236+
* here if our CPU was clear. However it could happen
237+
* due to a rare race with another CPU taking the
238+
* last CPU out of the mask concurrently.
239+
*
240+
* We can't add a warning for it. But just in case
241+
* there is a problem with the watchdog that is causing
242+
* the mask to not be reset, try to kick it along here.
243+
*/
244+
if (unlikely(cpumask_empty(&wd_smp_cpus_pending)))
245+
goto none_pending;
228246
}
229247
return;
230248
}
249+
231250
cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
251+
252+
/*
253+
* Order the store to clear pending with the load(s) to check all
254+
* words in the pending mask to check they are all empty. This orders
255+
* with the same barrier on another CPU. This prevents two CPUs
256+
* clearing the last 2 pending bits, but neither seeing the other's
257+
* store when checking if the mask is empty, and missing an empty
258+
* mask, which ends with a false positive.
259+
*/
260+
smp_mb();
232261
if (cpumask_empty(&wd_smp_cpus_pending)) {
233262
unsigned long flags;
234263

264+
none_pending:
265+
/*
266+
* Double check under lock because more than one CPU could see
267+
* a clear mask with the lockless check after clearing their
268+
* pending bits.
269+
*/
235270
wd_smp_lock(&flags);
236271
if (cpumask_empty(&wd_smp_cpus_pending)) {
237272
wd_smp_last_reset_tb = tb;
@@ -322,8 +357,12 @@ void arch_touch_nmi_watchdog(void)
322357
{
323358
unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
324359
int cpu = smp_processor_id();
325-
u64 tb = get_tb();
360+
u64 tb;
326361

362+
if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
363+
return;
364+
365+
tb = get_tb();
327366
if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
328367
per_cpu(wd_timer_tb, cpu) = tb;
329368
wd_smp_clear_cpu_pending(cpu, tb);

0 commit comments

Comments
 (0)