module: avoid allocation if module is already present and ready

PlaidCat · PlaidCat · commit 35edb641fd62 · 2024-10-22T16:12:15.000-04:00
jira LE-1907 Rebuild_History Non-Buildable kernel-5.14.0-427.40.1.el9_4 commit-author Luis Chamberlain <mcgrof@kernel.org> commit 064f453 Empty-Commit: Cherry-Pick Conflicts during history rebuild. Will be included in final tarball splat. Ref for failed cherry-pick at: ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed The finit_module() system call can create unnecessary virtual memory pressure for duplicate modules. This is because load_module() can in the worse case allocate more than twice the size of a module in virtual memory. This saves at least a full size of the module in wasted vmalloc space memory by trying to avoid duplicates as soon as we can validate the module name in the read module structure. This can only be an issue if a system is getting hammered with userspace loading modules. There are two ways to load modules typically on systems, one is the kernel moduile auto-loading (*request_module*() calls in-kernel) and the other is things like udev. The auto-loading is in-kernel, but that pings back to userspace to just call modprobe. We already have a way to restrict the amount of concurrent kernel auto-loads in a given time, however that still allows multiple requests for the same module to go through and force two threads in userspace racing to call modprobe for the same exact module. Even though libkmod which both modprobe and udev does check if a module is already loaded prior calling finit_module() races are still possible and this is clearly evident today when you have multiple CPUs. To avoid memory pressure for such stupid cases put a stop gap for them. The *earliest* we can detect duplicates from the modules side of things is once we have blessed the module name, sadly after the first vmalloc allocation. We can check for the module being present *before* a secondary vmalloc() allocation. There is a linear relationship between wasted virtual memory bytes and the number of CPU counts. The reason is that udev ends up racing to call tons of the same modules for each of the CPUs. We can see the different linear relationships between wasted virtual memory and CPU count during after boot in the following graph: +----------------------------------------------------------------------------+ 14GB |-+ + + + + *+ +-| | **** | | *** | | ** | 12GB |-+ ** +-| | ** | | ** | | ** | | ** | 10GB |-+ ** +-| | ** | | ** | | ** | 8GB |-+ ** +-| waste | ** ### | | ** #### | | ** ####### | 6GB |-+ **** #### +-| | * #### | | * #### | | ***** #### | 4GB |-+ ** #### +-| | ** #### | | ** #### | | ** #### | 2GB |-+ ** ##### +-| | * #### | | * #### Before ******* | | **## + + + + After ####### | +----------------------------------------------------------------------------+ 0 50 100 150 200 250 300 CPUs count On the y-axis we can see gigabytes of wasted virtual memory during boot due to duplicate module requests which just end up failing. Trying to infer the slope this ends up being about ~463 MiB per CPU lost prior to this patch. After this patch we only loose about ~230 MiB per CPU, for a total savings of about ~233 MiB per CPU. This is all *just on bootup*! On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests kmod.sh -t 0008 I see a saving in the *highest* side of memory consumption of up to ~ 84 MiB with the Linux kernel selftests kmod test 0008. With the new stress-ng module test I see a 145 MiB difference in max memory consumption with 100 ops. The stress-ng module ops tests can be pretty pathalogical -- it is not realistic, however it was used to finally successfully reproduce issues which are only reported to happen on system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM system. Running out of virtual memory space is no surprise given the above graph, since at least on x86_64 we're capped at 128 MiB, eventually we'd hit a series of errors and once can use the above graph to guestimate when. This of course will vary depending on the features you have enabled. So for instance, enabling KASAN seems to make this much worse. The results with kmod and stress-ng can be observed and visualized below. The time it takes to run the test is also not affected. The kmod tests 0008: The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib) given the tests peak around that range. cat kmod.plot set term dumb set output fileout set yrange [400000:580000] plot filein with linespoints title "Memory usage (KiB)" Before: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C root@kmod ~ # sort -n -r log-0008-before.txt | head -1 528732 So ~516.33 MiB After: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C root@kmod ~ # sort -n -r log-0008-after.txt | head -1 442516 So ~432.14 MiB That's about 84 ~MiB in savings in the worst case. The graphs: root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot root@kmod ~ # cat graph-0008-before.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | *A *AA*AA*A*AA *A*AA A*A*A *AA*A*AA*A A | 520000 |-+A*A*AA *AA*A *A*AA*A*AA *A*A A *A+-| |*A | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | | 440000 |-+ +-| | | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-0008-after.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | | 520000 |-+ +-| | | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | *A *A*A | 440000 |-+A*A*AA*A A A*A*AA A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-| |*A *A*AA*A | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 The stress-ng module tests: This is used to run the test to try to reproduce the vmap issues reported by David: echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs Prior to this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1 5046456 After this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt root@kmod ~ # sort -n -r after-stress-ng.txt | head -1 4896972 5046456 - 4896972 149484 149484/1024 145.98046875000000000000 So this commit using stress-ng reveals saving about 145 MiB in memory using 100 ops from stress-ng which reproduced the vmap issue reported. cat kmod.plot set term dumb set output fileout set yrange [4700000:5070000] plot filein with linespoints title "Memory usage (KiB)" root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'" kmod-simple-stress-ng.plot root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'" kmod-simple-stress-ng.plot root@kmod ~ # cat graph-stress-ng-before.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + A + + + + + + +-| | * Memory usage (KiB) ***A*** | | * A | 5e+06 |-+ ** ** +-| | ** * * A | 4.95e+06 |-+ * * A * A* +-| | * * A A * * * * A | | * * * * * * *A * * * A * | 4.9e+06 |-+ * * * A*A * A*AA*A A *A **A **A*A *+-| | A A*A A * A * * A A * A * ** | | * ** ** * * * * * * * | 4.85e+06 |-+ A A A ** * * ** *-| | * * * * ** * | | * A * * * * | 4.8e+06 |-+ * * * A A-| | * * * | 4.75e+06 |-+ * * * +-| | * ** | | * + + + + + + ** + | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-stress-ng-after.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + + + + + + + +-| | Memory usage (KiB) ***A*** | | | 5e+06 |-+ +-| | | 4.95e+06 |-+ +-| | | | | 4.9e+06 |-+ *AA +-| | A*AA*A*A A A*AA*AA*A*AA*A A A A*A *AA*A*A A A*AA*AA | | * * ** * * * ** * *** * | 4.85e+06 |-+* *** * * * * *** A * * +-| | * A * * ** * * A * * | | * * * * ** * * | 4.8e+06 |-+* * * A * * * +-| | * * * A * * | 4.75e+06 |-* * * * * +-| | * * * * * | | * + * *+ + + + + * *+ | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 [0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> (cherry picked from commit 064f453) Signed-off-by: Jonathan Maple <jmaple@ciq.com> # Conflicts: # kernel/module.c # kernel/module/stats.c
diff --git a/ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed b/ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed
@@ -0,0 +1,330 @@
+module: avoid allocation if module is already present and ready
+
+jira LE-1907
+Rebuild_History Non-Buildable kernel-5.14.0-427.40.1.el9_4
+commit-author Luis Chamberlain <mcgrof@kernel.org>
+commit 064f4536d13939b6e8cdb71298ff5d657f4f8caa
+Empty-Commit: Cherry-Pick Conflicts during history rebuild.
+Will be included in final tarball splat. Ref for failed cherry-pick at:
+ciq/ciq_backports/kernel-5.14.0-427.40.1.el9_4/064f4536.failed
+
+The finit_module() system call can create unnecessary virtual memory
+pressure for duplicate modules. This is because load_module() can in
+the worse case allocate more than twice the size of a module in virtual
+memory. This saves at least a full size of the module in wasted vmalloc
+space memory by trying to avoid duplicates as soon as we can validate
+the module name in the read module structure.
+
+This can only be an issue if a system is getting hammered with userspace
+loading modules. There are two ways to load modules typically on systems,
+one is the kernel moduile auto-loading (*request_module*() calls in-kernel)
+and the other is things like udev. The auto-loading is in-kernel, but that
+pings back to userspace to just call modprobe. We already have a way to
+restrict the amount of concurrent kernel auto-loads in a given time, however
+that still allows multiple requests for the same module to go through
+and force two threads in userspace racing to call modprobe for the same
+exact module. Even though libkmod which both modprobe and udev does check
+if a module is already loaded prior calling finit_module() races are
+still possible and this is clearly evident today when you have multiple
+CPUs.
+
+To avoid memory pressure for such stupid cases put a stop gap for them.
+The *earliest* we can detect duplicates from the modules side of things
+is once we have blessed the module name, sadly after the first vmalloc
+allocation. We can check for the module being present *before* a secondary
+vmalloc() allocation.
+
+There is a linear relationship between wasted virtual memory bytes and
+the number of CPU counts. The reason is that udev ends up racing to call
+tons of the same modules for each of the CPUs.
+
+We can see the different linear relationships between wasted virtual
+memory and CPU count during after boot in the following graph:
+
+         +----------------------------------------------------------------------------+
+    14GB |-+          +            +            +           +           *+          +-|
+         |                                                          ****              |
+         |                                                       ***                  |
+         |                                                     **                     |
+    12GB |-+                                                 **                     +-|
+         |                                                 **                         |
+         |                                               **                           |
+         |                                             **                             |
+         |                                           **                               |
+    10GB |-+                                       **                               +-|
+         |                                       **                                   |
+         |                                     **                                     |
+         |                                   **                                       |
+     8GB |-+                               **                                       +-|
+waste    |                               **                             ###           |
+         |                             **                           ####              |
+         |                           **                      #######                  |
+     6GB |-+                     ****                    ####                       +-|
+         |                      *                    ####                             |
+         |                     *                 ####                                 |
+         |                *****              ####                                     |
+     4GB |-+            **               ####                                       +-|
+         |            **             ####                                             |
+         |          **           ####                                                 |
+         |        **         ####                                                     |
+     2GB |-+    **      #####                                                       +-|
+         |     *    ####                                                              |
+         |    * ####                                                   Before ******* |
+         |  **##      +            +            +           +           After ####### |
+         +----------------------------------------------------------------------------+
+         0            50          100          150         200          250          300
+                                          CPUs count
+
+On the y-axis we can see gigabytes of wasted virtual memory during boot
+due to duplicate module requests which just end up failing. Trying to
+infer the slope this ends up being about ~463 MiB per CPU lost prior
+to this patch. After this patch we only loose about ~230 MiB per CPU, for
+a total savings of about ~233 MiB per CPU. This is all *just on bootup*!
+
+On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
+kmod.sh -t 0008 I see a saving in the *highest* side of memory
+consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
+test 0008. With the new stress-ng module test I see a 145 MiB difference
+in max memory consumption with 100 ops. The stress-ng module ops tests can be
+pretty pathalogical -- it is not realistic, however it was used to
+finally successfully reproduce issues which are only reported to happen on
+system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
+system. Running out of virtual memory space is no surprise given the
+above graph, since at least on x86_64 we're capped at 128 MiB, eventually
+we'd hit a series of errors and once can use the above graph to
+guestimate when. This of course will vary depending on the features
+you have enabled. So for instance, enabling KASAN seems to make this
+much worse.
+
+The results with kmod and stress-ng can be observed and visualized below.
+The time it takes to run the test is also not affected.
+
+The kmod tests 0008:
+
+The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
+given the tests peak around that range.
+
+cat kmod.plot
+set term dumb
+set output fileout
+set yrange [400000:580000]
+plot filein with linespoints title "Memory usage (KiB)"
+
+Before:
+root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
+root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
+root@kmod ~ # sort -n -r log-0008-before.txt | head -1
+528732
+
+So ~516.33 MiB
+
+After:
+
+root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
+root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C
+
+root@kmod ~ # sort -n -r log-0008-after.txt | head -1
+442516
+
+So ~432.14 MiB
+
+That's about 84 ~MiB in savings in the worst case. The graphs:
+
+root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
+root@kmod ~ # gnuplot -e "filein='log-0008-after.txt';  fileout='graph-0008-after.txt'"  kmod.plot
+
+root@kmod ~ # cat graph-0008-before.txt
+
+  580000 +-----------------------------------------------------------------+
+         |       +        +       +       +       +        +       +       |
+  560000 |-+                                    Memory usage (KiB) ***A***-|
+         |                                                                 |
+  540000 |-+                                                             +-|
+         |                                                                 |
+         |        *A     *AA*AA*A*AA          *A*AA    A*A*A *AA*A*AA*A  A |
+  520000 |-+A*A*AA  *AA*A           *A*AA*A*AA     *A*A     A          *A+-|
+         |*A                                                               |
+  500000 |-+                                                             +-|
+         |                                                                 |
+  480000 |-+                                                             +-|
+         |                                                                 |
+  460000 |-+                                                             +-|
+         |                                                                 |
+         |                                                                 |
+  440000 |-+                                                             +-|
+         |                                                                 |
+  420000 |-+                                                             +-|
+         |       +        +       +       +       +        +       +       |
+  400000 +-----------------------------------------------------------------+
+         0       5        10      15      20      25       30      35      40
+
+root@kmod ~ # cat graph-0008-after.txt
+
+  580000 +-----------------------------------------------------------------+
+         |       +        +       +       +       +        +       +       |
+  560000 |-+                                    Memory usage (KiB) ***A***-|
+         |                                                                 |
+  540000 |-+                                                             +-|
+         |                                                                 |
+         |                                                                 |
+  520000 |-+                                                             +-|
+         |                                                                 |
+  500000 |-+                                                             +-|
+         |                                                                 |
+  480000 |-+                                                             +-|
+         |                                                                 |
+  460000 |-+                                                             +-|
+         |                                                                 |
+         |          *A              *A*A                                   |
+  440000 |-+A*A*AA*A  A       A*A*AA    A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
+         |*A           *A*AA*A                                             |
+  420000 |-+                                                             +-|
+         |       +        +       +       +       +        +       +       |
+  400000 +-----------------------------------------------------------------+
+         0       5        10      15      20      25       30      35      40
+
+The stress-ng module tests:
+
+This is used to run the test to try to reproduce the vmap issues
+reported by David:
+
+  echo 0 > /proc/sys/vm/oom_dump_tasks
+  ./stress-ng --module 100 --module-name xfs
+
+Prior to this commit:
+root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
+root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
+5046456
+
+After this commit:
+root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
+root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
+4896972
+
+5046456 - 4896972
+149484
+149484/1024
+145.98046875000000000000
+
+So this commit using stress-ng reveals saving about 145 MiB in memory
+using 100 ops from stress-ng which reproduced the vmap issue reported.
+
+cat kmod.plot
+set term dumb
+set output fileout
+set yrange [4700000:5070000]
+plot filein with linespoints title "Memory usage (KiB)"
+
+root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'"  kmod-simple-stress-ng.plot
+root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'"  kmod-simple-stress-ng.plot
+
+root@kmod ~ # cat graph-stress-ng-before.txt
+
+           +---------------------------------------------------------------+
+  5.05e+06 |-+     + A     +       +       +       +       +       +     +-|
+           |         *                          Memory usage (KiB) ***A*** |
+           |         *                             A                       |
+     5e+06 |-+      **                            **                     +-|
+           |        **                            * *    A                 |
+  4.95e+06 |-+      * *                          A  *   A*               +-|
+           |        * *      A       A           *  *  *  *             A  |
+           |       *  *     * *     * *        *A   *  *  *      A      *  |
+   4.9e+06 |-+     *  *     * A*A   * A*AA*A  A      *A    **A   **A*A  *+-|
+           |       A  A*A  A    *  A       *  *      A     A *  A    * **  |
+           |      *      **      **         * *              *  *    * * * |
+  4.85e+06 |-+   A       A       A          **               *  *     ** *-|
+           |     *                           *               * *      ** * |
+           |     *                           A               * *      *  * |
+   4.8e+06 |-+   *                                           * *      A  A-|
+           |     *                                           * *           |
+  4.75e+06 |-+  *                                            * *         +-|
+           |    *                                            **            |
+           |    *  +       +       +       +       +       + **    +       |
+   4.7e+06 +---------------------------------------------------------------+
+           0       5       10      15      20      25      30      35      40
+
+root@kmod ~ # cat graph-stress-ng-after.txt
+
+           +---------------------------------------------------------------+
+  5.05e+06 |-+     +       +       +       +       +       +       +     +-|
+           |                                    Memory usage (KiB) ***A*** |
+           |                                                               |
+     5e+06 |-+                                                           +-|
+           |                                                               |
+  4.95e+06 |-+                                                           +-|
+           |                                                               |
+           |                                                               |
+   4.9e+06 |-+                                      *AA                  +-|
+           |  A*AA*A*A  A  A*AA*AA*A*AA*A  A  A  A*A   *AA*A*A  A  A*AA*AA |
+           |  *      * **  *            *  *  ** *            ***  *       |
+  4.85e+06 |-+*       ***  *            * * * ***             A *  *     +-|
+           |  *       A *  *             ** * * A               *  *       |
+           |  *         *  *             *  **                  *  *       |
+   4.8e+06 |-+*         *  *             A   *                  *  *     +-|
+           | *          * *                  A                  * *        |
+  4.75e+06 |-*          * *                                     * *      +-|
+           | *          * *                                     * *        |
+           | *     +    * *+       +       +       +       +    * *+       |
+   4.7e+06 +---------------------------------------------------------------+
+           0       5       10      15      20      25      30      35      40
+
+[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
+
+	Reported-by: David Hildenbrand <david@redhat.com>
+	Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
+(cherry picked from commit 064f4536d13939b6e8cdb71298ff5d657f4f8caa)
+	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
+
+# Conflicts:
+#	kernel/module.c
+#	kernel/module/stats.c
+diff --cc kernel/module.c
+index 90ad015c6fb5,044aa2c9e3cb..000000000000
+--- a/kernel/module.c
++++ b/kernel/module.c
+@@@ -3938,7 -2787,38 +3938,42 @@@ static int unknown_module_param_cb(cha
+  	return 0;
+  }
+  
+++<<<<<<< HEAD:kernel/module.c
+ +static void cfi_init(struct module *mod);
+++=======
++ /* Module within temporary copy, this doesn't do any allocation  */
++ static int early_mod_check(struct load_info *info, int flags)
++ {
++ 	int err;
++ 
++ 	/*
++ 	 * Now that we know we have the correct module name, check
++ 	 * if it's blacklisted.
++ 	 */
++ 	if (blacklisted(info->name)) {
++ 		pr_err("Module %s is blacklisted\n", info->name);
++ 		return -EPERM;
++ 	}
++ 
++ 	err = rewrite_section_headers(info, flags);
++ 	if (err)
++ 		return err;
++ 
++ 	/* Check module struct version now, before we try to use module. */
++ 	if (!check_modstruct_version(info, info->mod))
++ 		return -ENOEXEC;
++ 
++ 	err = check_modinfo(info->mod, info, flags);
++ 	if (err)
++ 		return err;
++ 
++ 	mutex_lock(&module_mutex);
++ 	err = module_patient_check_exists(info->mod->name, FAIL_DUP_MOD_BECOMING);
++ 	mutex_unlock(&module_mutex);
++ 
++ 	return err;
++ }
+++>>>>>>> 064f4536d139 (module: avoid allocation if module is already present and ready):kernel/module/main.c
+  
+  /*
+   * Allocate and load the module: note that size of section 0 is always
+* Unmerged path kernel/module/stats.c
+* Unmerged path kernel/module.c
+* Unmerged path kernel/module/stats.c