Skip to content

Commit 3704be6

Browse files
committed
Merge: percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4153 JIRA: https://issues.redhat.com/browse/RHEL-15605 This patch is a backport of the following upstream commit: commit 3a6358c Author: Yu Ma <yu.ma@intel.com> Date: Fri Jun 9 23:07:30 2023 -0400 percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing When running UnixBench/Execl throughput case, false sharing is observed due to frequent read on base_addr and write on free_bytes, chunk_md. UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. It will do system call on execl frequently, and execl will call mm_init to initialize mm_struct of the process. mm_init will call __percpu_counter_init for percpu_counters initialization. Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a specified chunk. This function will update "free_bytes" and "chunk_md" to record the rest free bytes and other meta data for this chunk. Correspondingly, pcpu_free_area will also update these 2 members when free memory. Call trace from perf is as below: + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock - 53.54% 0x654278696e552f34 main __execve entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_execve do_execveat_common.isra.47 alloc_bprm mm_init __percpu_counter_init pcpu_alloc - __mutex_lock.isra.17 In current pcpu_chunk layout, `base_addr' is in the same cache line with `free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a new cacheline. With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.4-rc4, the 160 parallel score improves by 24%. The pcpu_chunk struct is a backing data structure per chunk, so the additional memory should not be dramatic. A chunk covers ballpark between 64kb and 512kb memory depending on some config and boot time stuff, so I believe the additional memory used here is nominal at best. Working the #s on my desktop: Percpu: 58624 kB 28 cores -> ~2.1MB of percpu memory. At say ~128KB per chunk -> 33 chunks, generously 40 chunks. Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of overhead? I believe we can do a little better to avoid eating that full padding, so likely less than that. [dennis@kernel.org: changelog details] Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com Signed-off-by: Yu Ma <yu.ma@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Audra Mitchell <audra@redhat.com> Approved-by: Nico Pache <npache@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>
2 parents e89d40f + f0b0bbc commit 3704be6

File tree

1 file changed

+9
-2
lines changed

1 file changed

+9
-2
lines changed

mm/percpu-internal.h

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,17 @@ struct pcpu_chunk {
4141
struct list_head list; /* linked to pcpu_slot lists */
4242
int free_bytes; /* free bytes in the chunk */
4343
struct pcpu_block_md chunk_md;
44-
void *base_addr; /* base address of this chunk */
44+
unsigned long *bound_map; /* boundary map */
45+
46+
/*
47+
* base_addr is the base address of this chunk.
48+
* To reduce false sharing, current layout is optimized to make sure
49+
* base_addr locate in the different cacheline with free_bytes and
50+
* chunk_md.
51+
*/
52+
void *base_addr ____cacheline_aligned_in_smp;
4553

4654
unsigned long *alloc_map; /* allocation map */
47-
unsigned long *bound_map; /* boundary map */
4855
struct pcpu_block_md *md_blocks; /* metadata blocks */
4956

5057
void *data; /* chunk data */

0 commit comments

Comments
 (0)