Skip to content

Commit c34b3ec

Browse files
tpressureakpm00
authored andcommitted
mm: hugetlb: improve parallel huge page allocation time
Patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages", v2. Allocating huge pages can take a very long time on servers with terabytes of memory even when they are allocated at boot time where the allocation happens in parallel. Before this series, the kernel used a hard coded value of 2 threads per NUMA node for these allocations. This value might have been good enough in the past but it is not sufficient to fully utilize newer systems. This series changes the default so the kernel uses 25% of the available hardware threads for these allocations. In addition, we allow the user that wish to micro-optimize the allocation time to override this value via a new kernel parameter. We tested this on 2 generations of Xeon CPUs and the results show a big improvement of the overall allocation time. +-----------------------+-------+-------+-------+-------+-------+ | threads | 8 | 16 | 32 | 64 | 128 | +-----------------------+-------+-------+-------+-------+-------+ | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s | | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s | +-----------------------+-------+-------+-------+-------+-------+ On skylake, we see an improvment of 2.75x when using 32 threads, on cascade lake we can get even better at 4.3x when we use 128 threads. This speedup is quite significant and users of large machines like these should have the option to make the machines boot as fast as possible. This patch (of 3): Before this patch, the kernel currently used a hard coded value of 2 threads per NUMA node for these allocations. This patch changes this policy and the kernel now uses 25% of the available hardware threads for the allocations. Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-1-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 3dc30ef commit c34b3ec

File tree

1 file changed

+18
-16
lines changed

1 file changed

+18
-16
lines changed

mm/hugetlb.c

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,11 @@
1414
#include <linux/pagemap.h>
1515
#include <linux/mempolicy.h>
1616
#include <linux/compiler.h>
17+
#include <linux/cpumask.h>
1718
#include <linux/cpuset.h>
1819
#include <linux/mutex.h>
1920
#include <linux/memblock.h>
21+
#include <linux/minmax.h>
2022
#include <linux/sysfs.h>
2123
#include <linux/slab.h>
2224
#include <linux/sched/mm.h>
@@ -3605,31 +3607,31 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
36053607
.numa_aware = true
36063608
};
36073609

3610+
unsigned int num_allocation_threads = max(num_online_cpus() / 4, 1);
3611+
36083612
job.thread_fn = hugetlb_pages_alloc_boot_node;
36093613
job.start = 0;
36103614
job.size = h->max_huge_pages;
36113615

36123616
/*
3613-
* job.max_threads is twice the num_node_state(N_MEMORY),
3617+
* job.max_threads is 25% of the available cpu threads by default.
36143618
*
3615-
* Tests below indicate that a multiplier of 2 significantly improves
3616-
* performance, and although larger values also provide improvements,
3617-
* the gains are marginal.
3619+
* On large servers with terabytes of memory, huge page allocation
3620+
* can consume a considerably amount of time.
36183621
*
3619-
* Therefore, choosing 2 as the multiplier strikes a good balance between
3620-
* enhancing parallel processing capabilities and maintaining efficient
3621-
* resource management.
3622+
* Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
3623+
* 2MiB huge pages. Using more threads can significantly improve allocation time.
36223624
*
3623-
* +------------+-------+-------+-------+-------+-------+
3624-
* | multiplier | 1 | 2 | 3 | 4 | 5 |
3625-
* +------------+-------+-------+-------+-------+-------+
3626-
* | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
3627-
* | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms |
3628-
* | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms |
3629-
* +------------+-------+-------+-------+-------+-------+
3625+
* +-----------------------+-------+-------+-------+-------+-------+
3626+
* | threads | 8 | 16 | 32 | 64 | 128 |
3627+
* +-----------------------+-------+-------+-------+-------+-------+
3628+
* | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s |
3629+
* | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s |
3630+
* +-----------------------+-------+-------+-------+-------+-------+
36303631
*/
3631-
job.max_threads = num_node_state(N_MEMORY) * 2;
3632-
job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
3632+
3633+
job.max_threads = num_allocation_threads;
3634+
job.min_chunk = h->max_huge_pages / num_allocation_threads;
36333635
padata_do_multithreaded(&job);
36343636

36353637
return h->nr_huge_pages;

0 commit comments

Comments
 (0)