Commit 6965f91
drm/xe/guc_submit: fix race around pending_disable
Currently in some testcases we can trigger:
[drm] *ERROR* GT0: SCHED_DONE: Unexpected engine state 0x02b1, guc_id=8, runnable_state=0
[drm] *ERROR* GT0: G2H action 0x1002 failed (-EPROTO) len 3 msg 02 10 00 90 08 00 00 00 00 00 00 00
Looking at a snippet of corresponding ftrace for this GuC id we can see:
498.852891: xe_sched_msg_add: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3
498.854083: xe_sched_msg_recv: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3
498.855389: xe_exec_queue_kill: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x3, flags=0x0
498.855436: xe_exec_queue_lr_cleanup: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0
498.856767: xe_exec_queue_close: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0
498.862889: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0xa9, flags=0x0
498.863032: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0
498.875596: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0
498.875604: xe_exec_queue_deregister: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0
499.074483: xe_exec_queue_deregister_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0
This looks to be the two scheduling_disable racing with each other, one
from the suspend (opcode=3) and then again during lr cleanup. While
those two operations are serialized, the G2H portion is not, therefore
when marking the queue as pending_disabled and then firing off the first
request, we proceed do the same again, however the first disable
response only fires after this which then clears the pending_disabled.
At this point the second comes back and is processed, however the
pending_disabled is no longer set, hence triggering the warning.
To fix this wait for pending_disabled when doing the lr cleanup and
calling disable_scheduling_deregister. Also do the same for all other
disable_scheduling callers.
Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3515
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <mattheq.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-5-matthew.auld@intel.com
(cherry picked from commit ddb106d)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>1 parent ece4502 commit 6965f91
1 file changed
+10
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
769 | 769 | | |
770 | 770 | | |
771 | 771 | | |
772 | | - | |
773 | 772 | | |
774 | 773 | | |
775 | 774 | | |
776 | 775 | | |
777 | | - | |
778 | | - | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
779 | 781 | | |
780 | 782 | | |
781 | 783 | | |
782 | | - | |
| 784 | + | |
783 | 785 | | |
784 | 786 | | |
785 | 787 | | |
| |||
1101 | 1103 | | |
1102 | 1104 | | |
1103 | 1105 | | |
1104 | | - | |
| 1106 | + | |
| 1107 | + | |
1105 | 1108 | | |
1106 | 1109 | | |
1107 | 1110 | | |
| |||
1330 | 1333 | | |
1331 | 1334 | | |
1332 | 1335 | | |
1333 | | - | |
1334 | | - | |
| 1336 | + | |
| 1337 | + | |
1335 | 1338 | | |
1336 | 1339 | | |
1337 | 1340 | | |
| |||
0 commit comments