Commit 1461168
committed
feat(gfd): enhance device health check with compute capability probe
Improve GFD's device health checking to catch devices in degraded
states (e.g., after XID errors). Previously, devices might pass the
GetName() check but fail during actual labeling. Now uses two probes:
1. GetName() - catches completely dead devices
2. GetCudaComputeCapability() - catches degraded devices
This prevents scenarios where XID errors leave devices in a state where
basic queries succeed but complex queries fail, which would cause
partial label generation and unnecessary warnings.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>1 parent 8e66a4a commit 1461168
File tree
7 files changed
+436
-21
lines changed- cmd/gpu-feature-discovery
- internal
- lm
- resource
- testing
7 files changed
+436
-21
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
254 | 254 | | |
255 | 255 | | |
256 | 256 | | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
257 | 263 | | |
| 264 | + | |
258 | 265 | | |
259 | 266 | | |
260 | | - | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
261 | 280 | | |
262 | 281 | | |
263 | 282 | | |
| |||
267 | 286 | | |
268 | 287 | | |
269 | 288 | | |
270 | | - | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
271 | 302 | | |
272 | 303 | | |
273 | 304 | | |
274 | 305 | | |
275 | 306 | | |
276 | 307 | | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
277 | 315 | | |
278 | 316 | | |
279 | | - | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
280 | 330 | | |
281 | 331 | | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
282 | 335 | | |
283 | 336 | | |
284 | 337 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
21 | 20 | | |
22 | 21 | | |
23 | 22 | | |
| |||
48 | 47 | | |
49 | 48 | | |
50 | 49 | | |
51 | | - | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
59 | | - | |
60 | | - | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
61 | 63 | | |
62 | 64 | | |
63 | 65 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
53 | 70 | | |
54 | 71 | | |
55 | 72 | | |
| |||
160 | 177 | | |
161 | 178 | | |
162 | 179 | | |
163 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
164 | 183 | | |
165 | 184 | | |
166 | 185 | | |
| |||
201 | 220 | | |
202 | 221 | | |
203 | 222 | | |
204 | | - | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
205 | 226 | | |
206 | 227 | | |
207 | 228 | | |
| |||
247 | 268 | | |
248 | 269 | | |
249 | 270 | | |
250 | | - | |
| 271 | + | |
251 | 272 | | |
252 | 273 | | |
253 | | - | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
254 | 277 | | |
255 | 278 | | |
256 | 279 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
44 | 50 | | |
45 | 51 | | |
46 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
47 | 55 | | |
48 | 56 | | |
49 | 57 | | |
| |||
55 | 63 | | |
56 | 64 | | |
57 | 65 | | |
58 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
59 | 69 | | |
60 | 70 | | |
61 | 71 | | |
| |||
78 | 88 | | |
79 | 89 | | |
80 | 90 | | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
81 | 97 | | |
82 | 98 | | |
83 | | - | |
| 99 | + | |
| 100 | + | |
84 | 101 | | |
85 | 102 | | |
86 | 103 | | |
87 | | - | |
| 104 | + | |
| 105 | + | |
88 | 106 | | |
89 | 107 | | |
90 | 108 | | |
91 | 109 | | |
92 | | - | |
| 110 | + | |
| 111 | + | |
93 | 112 | | |
94 | 113 | | |
95 | 114 | | |
96 | 115 | | |
97 | 116 | | |
98 | 117 | | |
99 | | - | |
| 118 | + | |
| 119 | + | |
100 | 120 | | |
101 | 121 | | |
102 | 122 | | |
| |||
252 | 272 | | |
253 | 273 | | |
254 | 274 | | |
255 | | - | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
256 | 278 | | |
257 | 279 | | |
258 | 280 | | |
| |||
263 | 285 | | |
264 | 286 | | |
265 | 287 | | |
266 | | - | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
267 | 291 | | |
268 | 292 | | |
269 | 293 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
0 commit comments