Skip to content

Commit 35edb64

Browse files
committed
module: avoid allocation if module is already present and ready
jira LE-1907 Rebuild_History Non-Buildable kernel-5.14.0-427.40.1.el9_4 commit-author Luis Chamberlain <mcgrof@kernel.org> commit 064f453 Empty-Commit: Cherry-Pick Conflicts during history rebuild. Will be included in final tarball splat. Ref for failed cherry-pick at: ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed The finit_module() system call can create unnecessary virtual memory pressure for duplicate modules. This is because load_module() can in the worse case allocate more than twice the size of a module in virtual memory. This saves at least a full size of the module in wasted vmalloc space memory by trying to avoid duplicates as soon as we can validate the module name in the read module structure. This can only be an issue if a system is getting hammered with userspace loading modules. There are two ways to load modules typically on systems, one is the kernel moduile auto-loading (*request_module*() calls in-kernel) and the other is things like udev. The auto-loading is in-kernel, but that pings back to userspace to just call modprobe. We already have a way to restrict the amount of concurrent kernel auto-loads in a given time, however that still allows multiple requests for the same module to go through and force two threads in userspace racing to call modprobe for the same exact module. Even though libkmod which both modprobe and udev does check if a module is already loaded prior calling finit_module() races are still possible and this is clearly evident today when you have multiple CPUs. To avoid memory pressure for such stupid cases put a stop gap for them. The *earliest* we can detect duplicates from the modules side of things is once we have blessed the module name, sadly after the first vmalloc allocation. We can check for the module being present *before* a secondary vmalloc() allocation. There is a linear relationship between wasted virtual memory bytes and the number of CPU counts. The reason is that udev ends up racing to call tons of the same modules for each of the CPUs. We can see the different linear relationships between wasted virtual memory and CPU count during after boot in the following graph: +----------------------------------------------------------------------------+ 14GB |-+ + + + + *+ +-| | **** | | *** | | ** | 12GB |-+ ** +-| | ** | | ** | | ** | | ** | 10GB |-+ ** +-| | ** | | ** | | ** | 8GB |-+ ** +-| waste | ** ### | | ** #### | | ** ####### | 6GB |-+ **** #### +-| | * #### | | * #### | | ***** #### | 4GB |-+ ** #### +-| | ** #### | | ** #### | | ** #### | 2GB |-+ ** ##### +-| | * #### | | * #### Before ******* | | **## + + + + After ####### | +----------------------------------------------------------------------------+ 0 50 100 150 200 250 300 CPUs count On the y-axis we can see gigabytes of wasted virtual memory during boot due to duplicate module requests which just end up failing. Trying to infer the slope this ends up being about ~463 MiB per CPU lost prior to this patch. After this patch we only loose about ~230 MiB per CPU, for a total savings of about ~233 MiB per CPU. This is all *just on bootup*! On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests kmod.sh -t 0008 I see a saving in the *highest* side of memory consumption of up to ~ 84 MiB with the Linux kernel selftests kmod test 0008. With the new stress-ng module test I see a 145 MiB difference in max memory consumption with 100 ops. The stress-ng module ops tests can be pretty pathalogical -- it is not realistic, however it was used to finally successfully reproduce issues which are only reported to happen on system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM system. Running out of virtual memory space is no surprise given the above graph, since at least on x86_64 we're capped at 128 MiB, eventually we'd hit a series of errors and once can use the above graph to guestimate when. This of course will vary depending on the features you have enabled. So for instance, enabling KASAN seems to make this much worse. The results with kmod and stress-ng can be observed and visualized below. The time it takes to run the test is also not affected. The kmod tests 0008: The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib) given the tests peak around that range. cat kmod.plot set term dumb set output fileout set yrange [400000:580000] plot filein with linespoints title "Memory usage (KiB)" Before: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C root@kmod ~ # sort -n -r log-0008-before.txt | head -1 528732 So ~516.33 MiB After: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C root@kmod ~ # sort -n -r log-0008-after.txt | head -1 442516 So ~432.14 MiB That's about 84 ~MiB in savings in the worst case. The graphs: root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot root@kmod ~ # cat graph-0008-before.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | *A *AA*AA*A*AA *A*AA A*A*A *AA*A*AA*A A | 520000 |-+A*A*AA *AA*A *A*AA*A*AA *A*A A *A+-| |*A | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | | 440000 |-+ +-| | | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-0008-after.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | | 520000 |-+ +-| | | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | *A *A*A | 440000 |-+A*A*AA*A A A*A*AA A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-| |*A *A*AA*A | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 The stress-ng module tests: This is used to run the test to try to reproduce the vmap issues reported by David: echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs Prior to this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1 5046456 After this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt root@kmod ~ # sort -n -r after-stress-ng.txt | head -1 4896972 5046456 - 4896972 149484 149484/1024 145.98046875000000000000 So this commit using stress-ng reveals saving about 145 MiB in memory using 100 ops from stress-ng which reproduced the vmap issue reported. cat kmod.plot set term dumb set output fileout set yrange [4700000:5070000] plot filein with linespoints title "Memory usage (KiB)" root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'" kmod-simple-stress-ng.plot root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'" kmod-simple-stress-ng.plot root@kmod ~ # cat graph-stress-ng-before.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + A + + + + + + +-| | * Memory usage (KiB) ***A*** | | * A | 5e+06 |-+ ** ** +-| | ** * * A | 4.95e+06 |-+ * * A * A* +-| | * * A A * * * * A | | * * * * * * *A * * * A * | 4.9e+06 |-+ * * * A*A * A*AA*A A *A **A **A*A *+-| | A A*A A * A * * A A * A * ** | | * ** ** * * * * * * * | 4.85e+06 |-+ A A A ** * * ** *-| | * * * * ** * | | * A * * * * | 4.8e+06 |-+ * * * A A-| | * * * | 4.75e+06 |-+ * * * +-| | * ** | | * + + + + + + ** + | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-stress-ng-after.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + + + + + + + +-| | Memory usage (KiB) ***A*** | | | 5e+06 |-+ +-| | | 4.95e+06 |-+ +-| | | | | 4.9e+06 |-+ *AA +-| | A*AA*A*A A A*AA*AA*A*AA*A A A A*A *AA*A*A A A*AA*AA | | * * ** * * * ** * *** * | 4.85e+06 |-+* *** * * * * *** A * * +-| | * A * * ** * * A * * | | * * * * ** * * | 4.8e+06 |-+* * * A * * * +-| | * * * A * * | 4.75e+06 |-* * * * * +-| | * * * * * | | * + * *+ + + + + * *+ | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 [0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> (cherry picked from commit 064f453) Signed-off-by: Jonathan Maple <jmaple@ciq.com> # Conflicts: # kernel/module.c # kernel/module/stats.c
1 parent bc1171a commit 35edb64

File tree

1 file changed

+330
-0
lines changed

1 file changed

+330
-0
lines changed
Lines changed: 330 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,330 @@
1+
module: avoid allocation if module is already present and ready
2+
3+
jira LE-1907
4+
Rebuild_History Non-Buildable kernel-5.14.0-427.40.1.el9_4
5+
commit-author Luis Chamberlain <mcgrof@kernel.org>
6+
commit 064f4536d13939b6e8cdb71298ff5d657f4f8caa
7+
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
8+
Will be included in final tarball splat. Ref for failed cherry-pick at:
9+
ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed
10+
11+
The finit_module() system call can create unnecessary virtual memory
12+
pressure for duplicate modules. This is because load_module() can in
13+
the worse case allocate more than twice the size of a module in virtual
14+
memory. This saves at least a full size of the module in wasted vmalloc
15+
space memory by trying to avoid duplicates as soon as we can validate
16+
the module name in the read module structure.
17+
18+
This can only be an issue if a system is getting hammered with userspace
19+
loading modules. There are two ways to load modules typically on systems,
20+
one is the kernel moduile auto-loading (*request_module*() calls in-kernel)
21+
and the other is things like udev. The auto-loading is in-kernel, but that
22+
pings back to userspace to just call modprobe. We already have a way to
23+
restrict the amount of concurrent kernel auto-loads in a given time, however
24+
that still allows multiple requests for the same module to go through
25+
and force two threads in userspace racing to call modprobe for the same
26+
exact module. Even though libkmod which both modprobe and udev does check
27+
if a module is already loaded prior calling finit_module() races are
28+
still possible and this is clearly evident today when you have multiple
29+
CPUs.
30+
31+
To avoid memory pressure for such stupid cases put a stop gap for them.
32+
The *earliest* we can detect duplicates from the modules side of things
33+
is once we have blessed the module name, sadly after the first vmalloc
34+
allocation. We can check for the module being present *before* a secondary
35+
vmalloc() allocation.
36+
37+
There is a linear relationship between wasted virtual memory bytes and
38+
the number of CPU counts. The reason is that udev ends up racing to call
39+
tons of the same modules for each of the CPUs.
40+
41+
We can see the different linear relationships between wasted virtual
42+
memory and CPU count during after boot in the following graph:
43+
44+
+----------------------------------------------------------------------------+
45+
14GB |-+ + + + + *+ +-|
46+
| **** |
47+
| *** |
48+
| ** |
49+
12GB |-+ ** +-|
50+
| ** |
51+
| ** |
52+
| ** |
53+
| ** |
54+
10GB |-+ ** +-|
55+
| ** |
56+
| ** |
57+
| ** |
58+
8GB |-+ ** +-|
59+
waste | ** ### |
60+
| ** #### |
61+
| ** ####### |
62+
6GB |-+ **** #### +-|
63+
| * #### |
64+
| * #### |
65+
| ***** #### |
66+
4GB |-+ ** #### +-|
67+
| ** #### |
68+
| ** #### |
69+
| ** #### |
70+
2GB |-+ ** ##### +-|
71+
| * #### |
72+
| * #### Before ******* |
73+
| **## + + + + After ####### |
74+
+----------------------------------------------------------------------------+
75+
0 50 100 150 200 250 300
76+
CPUs count
77+
78+
On the y-axis we can see gigabytes of wasted virtual memory during boot
79+
due to duplicate module requests which just end up failing. Trying to
80+
infer the slope this ends up being about ~463 MiB per CPU lost prior
81+
to this patch. After this patch we only loose about ~230 MiB per CPU, for
82+
a total savings of about ~233 MiB per CPU. This is all *just on bootup*!
83+
84+
On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
85+
kmod.sh -t 0008 I see a saving in the *highest* side of memory
86+
consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
87+
test 0008. With the new stress-ng module test I see a 145 MiB difference
88+
in max memory consumption with 100 ops. The stress-ng module ops tests can be
89+
pretty pathalogical -- it is not realistic, however it was used to
90+
finally successfully reproduce issues which are only reported to happen on
91+
system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
92+
system. Running out of virtual memory space is no surprise given the
93+
above graph, since at least on x86_64 we're capped at 128 MiB, eventually
94+
we'd hit a series of errors and once can use the above graph to
95+
guestimate when. This of course will vary depending on the features
96+
you have enabled. So for instance, enabling KASAN seems to make this
97+
much worse.
98+
99+
The results with kmod and stress-ng can be observed and visualized below.
100+
The time it takes to run the test is also not affected.
101+
102+
The kmod tests 0008:
103+
104+
The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
105+
given the tests peak around that range.
106+
107+
cat kmod.plot
108+
set term dumb
109+
set output fileout
110+
set yrange [400000:580000]
111+
plot filein with linespoints title "Memory usage (KiB)"
112+
113+
Before:
114+
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
115+
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
116+
root@kmod ~ # sort -n -r log-0008-before.txt | head -1
117+
528732
118+
119+
So ~516.33 MiB
120+
121+
After:
122+
123+
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
124+
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C
125+
126+
root@kmod ~ # sort -n -r log-0008-after.txt | head -1
127+
442516
128+
129+
So ~432.14 MiB
130+
131+
That's about 84 ~MiB in savings in the worst case. The graphs:
132+
133+
root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
134+
root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot
135+
136+
root@kmod ~ # cat graph-0008-before.txt
137+
138+
580000 +-----------------------------------------------------------------+
139+
| + + + + + + + |
140+
560000 |-+ Memory usage (KiB) ***A***-|
141+
| |
142+
540000 |-+ +-|
143+
| |
144+
| *A *AA*AA*A*AA *A*AA A*A*A *AA*A*AA*A A |
145+
520000 |-+A*A*AA *AA*A *A*AA*A*AA *A*A A *A+-|
146+
|*A |
147+
500000 |-+ +-|
148+
| |
149+
480000 |-+ +-|
150+
| |
151+
460000 |-+ +-|
152+
| |
153+
| |
154+
440000 |-+ +-|
155+
| |
156+
420000 |-+ +-|
157+
| + + + + + + + |
158+
400000 +-----------------------------------------------------------------+
159+
0 5 10 15 20 25 30 35 40
160+
161+
root@kmod ~ # cat graph-0008-after.txt
162+
163+
580000 +-----------------------------------------------------------------+
164+
| + + + + + + + |
165+
560000 |-+ Memory usage (KiB) ***A***-|
166+
| |
167+
540000 |-+ +-|
168+
| |
169+
| |
170+
520000 |-+ +-|
171+
| |
172+
500000 |-+ +-|
173+
| |
174+
480000 |-+ +-|
175+
| |
176+
460000 |-+ +-|
177+
| |
178+
| *A *A*A |
179+
440000 |-+A*A*AA*A A A*A*AA A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
180+
|*A *A*AA*A |
181+
420000 |-+ +-|
182+
| + + + + + + + |
183+
400000 +-----------------------------------------------------------------+
184+
0 5 10 15 20 25 30 35 40
185+
186+
The stress-ng module tests:
187+
188+
This is used to run the test to try to reproduce the vmap issues
189+
reported by David:
190+
191+
echo 0 > /proc/sys/vm/oom_dump_tasks
192+
./stress-ng --module 100 --module-name xfs
193+
194+
Prior to this commit:
195+
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
196+
root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
197+
5046456
198+
199+
After this commit:
200+
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
201+
root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
202+
4896972
203+
204+
5046456 - 4896972
205+
149484
206+
149484/1024
207+
145.98046875000000000000
208+
209+
So this commit using stress-ng reveals saving about 145 MiB in memory
210+
using 100 ops from stress-ng which reproduced the vmap issue reported.
211+
212+
cat kmod.plot
213+
set term dumb
214+
set output fileout
215+
set yrange [4700000:5070000]
216+
plot filein with linespoints title "Memory usage (KiB)"
217+
218+
root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'" kmod-simple-stress-ng.plot
219+
root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'" kmod-simple-stress-ng.plot
220+
221+
root@kmod ~ # cat graph-stress-ng-before.txt
222+
223+
+---------------------------------------------------------------+
224+
5.05e+06 |-+ + A + + + + + + +-|
225+
| * Memory usage (KiB) ***A*** |
226+
| * A |
227+
5e+06 |-+ ** ** +-|
228+
| ** * * A |
229+
4.95e+06 |-+ * * A * A* +-|
230+
| * * A A * * * * A |
231+
| * * * * * * *A * * * A * |
232+
4.9e+06 |-+ * * * A*A * A*AA*A A *A **A **A*A *+-|
233+
| A A*A A * A * * A A * A * ** |
234+
| * ** ** * * * * * * * |
235+
4.85e+06 |-+ A A A ** * * ** *-|
236+
| * * * * ** * |
237+
| * A * * * * |
238+
4.8e+06 |-+ * * * A A-|
239+
| * * * |
240+
4.75e+06 |-+ * * * +-|
241+
| * ** |
242+
| * + + + + + + ** + |
243+
4.7e+06 +---------------------------------------------------------------+
244+
0 5 10 15 20 25 30 35 40
245+
246+
root@kmod ~ # cat graph-stress-ng-after.txt
247+
248+
+---------------------------------------------------------------+
249+
5.05e+06 |-+ + + + + + + + +-|
250+
| Memory usage (KiB) ***A*** |
251+
| |
252+
5e+06 |-+ +-|
253+
| |
254+
4.95e+06 |-+ +-|
255+
| |
256+
| |
257+
4.9e+06 |-+ *AA +-|
258+
| A*AA*A*A A A*AA*AA*A*AA*A A A A*A *AA*A*A A A*AA*AA |
259+
| * * ** * * * ** * *** * |
260+
4.85e+06 |-+* *** * * * * *** A * * +-|
261+
| * A * * ** * * A * * |
262+
| * * * * ** * * |
263+
4.8e+06 |-+* * * A * * * +-|
264+
| * * * A * * |
265+
4.75e+06 |-* * * * * +-|
266+
| * * * * * |
267+
| * + * *+ + + + + * *+ |
268+
4.7e+06 +---------------------------------------------------------------+
269+
0 5 10 15 20 25 30 35 40
270+
271+
[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
272+
273+
Reported-by: David Hildenbrand <david@redhat.com>
274+
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
275+
(cherry picked from commit 064f4536d13939b6e8cdb71298ff5d657f4f8caa)
276+
Signed-off-by: Jonathan Maple <jmaple@ciq.com>
277+
278+
# Conflicts:
279+
# kernel/module.c
280+
# kernel/module/stats.c
281+
diff --cc kernel/module.c
282+
index 90ad015c6fb5,044aa2c9e3cb..000000000000
283+
--- a/kernel/module.c
284+
+++ b/kernel/module.c
285+
@@@ -3938,7 -2787,38 +3938,42 @@@ static int unknown_module_param_cb(cha
286+
return 0;
287+
}
288+
289+
++<<<<<<< HEAD:kernel/module.c
290+
+static void cfi_init(struct module *mod);
291+
++=======
292+
+ /* Module within temporary copy, this doesn't do any allocation */
293+
+ static int early_mod_check(struct load_info *info, int flags)
294+
+ {
295+
+ int err;
296+
+
297+
+ /*
298+
+ * Now that we know we have the correct module name, check
299+
+ * if it's blacklisted.
300+
+ */
301+
+ if (blacklisted(info->name)) {
302+
+ pr_err("Module %s is blacklisted\n", info->name);
303+
+ return -EPERM;
304+
+ }
305+
+
306+
+ err = rewrite_section_headers(info, flags);
307+
+ if (err)
308+
+ return err;
309+
+
310+
+ /* Check module struct version now, before we try to use module. */
311+
+ if (!check_modstruct_version(info, info->mod))
312+
+ return -ENOEXEC;
313+
+
314+
+ err = check_modinfo(info->mod, info, flags);
315+
+ if (err)
316+
+ return err;
317+
+
318+
+ mutex_lock(&module_mutex);
319+
+ err = module_patient_check_exists(info->mod->name, FAIL_DUP_MOD_BECOMING);
320+
+ mutex_unlock(&module_mutex);
321+
+
322+
+ return err;
323+
+ }
324+
++>>>>>>> 064f4536d139 (module: avoid allocation if module is already present and ready):kernel/module/main.c
325+
326+
/*
327+
* Allocate and load the module: note that size of section 0 is always
328+
* Unmerged path kernel/module/stats.c
329+
* Unmerged path kernel/module.c
330+
* Unmerged path kernel/module/stats.c

0 commit comments

Comments
 (0)