Skip to content

Commit dfc8d73

Browse files
authored
Merge pull request #41 from synopse/main
abouchez / mORMot: asm code refactoring + README precisions
2 parents 30d4be0 + 3016ee8 commit dfc8d73

File tree

2 files changed

+76
-14
lines changed

2 files changed

+76
-14
lines changed

entries/abouchez/README.md

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,15 @@ Here are the main ideas behind this implementation proposal:
3535
- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch);
3636
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
3737

38-
The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance. The L1 cache is well known to be the main bottleneck for any efficient in-memory process. We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
38+
## Why L1 Cache Matters
39+
40+
The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
41+
42+
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
43+
44+
We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
45+
46+
In our first attempt, the `Station[]` array was in fact not aligned to 64 bytes itself. In fact, the RTL `SetLength()` does not align its data to the item size. So the pointer was aligned by 32 bytes, and any memory access would require filling two L1 cache lines. So we added some manual alignement of the data structure, and got 5% better performance.
3947

4048
## Usage
4149

@@ -133,14 +141,60 @@ time ./abouchez measurements.txt -v -t=1
133141
```
134142
This `-t=1` run is for fun: it will run the process in a single thread. It will help to guess how optimized (and lockfree) our parsing code is, and to validate the CPU multi-core abilities. In a perfect world, other `-t=##` runs should stand for a perfect division of `real` time per the number of working threads, and the `user` value reported by `time` should remain almost the same when we add threads up to the number of CPU cores.
135143

136-
## Feedback Needed
144+
## Back To Reality
145+
146+
Our proposal has been run on the benchmark hardware, using the full automation.
147+
148+
With 30 threads (on a busy system):
149+
```
150+
-- SSD --
151+
Benchmark 1: abouchez
152+
Time (mean ± σ): 3.634 s ± 0.099 s [User: 86.580 s, System: 2.012 s]
153+
Range (min … max): 3.530 s … 3.834 s 10 runs
154+
155+
-- HDD --
156+
Benchmark 1: abouchez
157+
Time (mean ± σ): 3.629 s ± 0.102 s [User: 86.086 s, System: 2.008 s]
158+
Range (min … max): 3.497 s … 3.789 s 10 runs
159+
```
160+
161+
Later on, only the SSD values are shown, because the HDD version triggered the systemd watchdog, which killed the shell and its benchmark executable. But we can see that once the data is loaded from disk into the RAM cache, there is no difference with a `memmap` file on SSD and HDD. Linux is a great Operating System for sure.
162+
163+
With 24 threads:
164+
```
165+
-- SSD --
166+
Benchmark 1: abouchez
167+
Time (mean ± σ): 2.977 s ± 0.053 s [User: 53.790 s, System: 1.881 s]
168+
Range (min … max): 2.905 s … 3.060 s 10 runs
169+
```
170+
171+
With 16 threads:
172+
```
173+
-- SSD --
174+
Benchmark 1: abouchez
175+
Time (mean ± σ): 2.472 s ± 0.061 s [User: 27.787 s, System: 1.720 s]
176+
Range (min … max): 2.386 s … 2.588 s 10 runs
177+
```
137178

138-
Here we will put some additional information, once our proposal has been run on the benchmark hardware.
179+
With 16 threads and thread affinity (`-a` switch on command line):
180+
```
181+
-- SSD --
182+
Benchmark 1: abouchez
183+
Time (mean ± σ): 3.227 s ± 0.017 s [User: 39.731 s, System: 1.875 s]
184+
Range (min … max): 3.206 s … 3.253 s 10 runs
185+
```
186+
187+
So it sounds like if we should just run the benchmark with the `-t=16` option.
139188

140-
Stay tuned!
189+
It may be as expected:
190+
191+
- The Ryzen CPU has 16 cores with 32 threads, and it makes sense that using only the "real" cores with CPU+RAM intensive work is enough to saturate them;
192+
- It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Linux Operating System schedule the threads to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen.
141193

142194
## Ending Note
143195

196+
You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly.
197+
144198
There is a "*pure mORMot*" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
145199

146200
Arnaud :D

entries/abouchez/src/brcmormot.lpr

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
// a dedicated hash table is 40% faster than mORMot generic TDynArrayHashed
66

77
{$define CUSTOMASM}
8-
// a few % faster with some dedicated asm instead of mORMot code on x86_64
8+
// about 10% faster with some dedicated asm instead of mORMot code on x86_64
99

1010
{$I mormot.defines.inc}
1111

@@ -101,14 +101,14 @@ procedure TBrcList.Init(max: integer; align: boolean);
101101
SetLength(StationMem, max); // RTL won't align by 64 bytes
102102
Station := pointer(StationMem);
103103
if align then
104-
while PtrUInt(Station) and 63 <> 0 do // manual alignment
104+
while {%H-}PtrUInt(Station) and 63 <> 0 do // manual alignment
105105
inc(PByte(Station));
106106
SetLength(StationHash, HASHSIZE);
107107
end;
108108

109109
{$ifdef CUSTOMASM}
110110

111-
function crc32c(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler;
111+
function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler;
112112
asm
113113
xor eax, eax // it is enough to hash up to 15 bytes for our purpose
114114
mov ecx, len
@@ -130,7 +130,7 @@ function crc32c(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler
130130
@z:
131131
end;
132132

133-
function MemEqual(a, b: pointer; len: PtrInt): integer; nostackframe; assembler;
133+
function CompareMem(a, b: pointer; len: PtrInt): boolean; nostackframe; assembler;
134134
asm
135135
add a, len
136136
add b, len
@@ -164,9 +164,18 @@ function MemEqual(a, b: pointer; len: PtrInt): integer; nostackframe; assembler;
164164
mov al, byte ptr [a + len]
165165
cmp al, byte ptr [b + len]
166166
je @eq
167-
@diff: mov eax, 1
167+
@diff: xor eax, eax
168168
ret
169-
@eq: xor eax, eax // 0 = found (most common case of no hash collision)
169+
@eq: mov eax, 1 // = found (most common case of no hash collision)
170+
end;
171+
172+
{$else}
173+
174+
function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; inline;
175+
begin
176+
if len > 16 then
177+
len := 16; // it is enough to hash up to 16 bytes for our purpose
178+
result := DefaultHasher(0, buf, len); // fast mORMot asm hasher (crc32c)
170179
end;
171180

172181
{$endif CUSTOMASM}
@@ -176,16 +185,15 @@ function TBrcList.Search(name: pointer; namelen: PtrInt): PBrcStation;
176185
h, x: PtrUInt;
177186
begin
178187
assert(namelen <= SizeOf(TBrcStation.NameText));
179-
h := crc32c({$ifndef CUSTOMASM} 0, {$endif} name, namelen);
188+
h := dohash(name, namelen);
180189
repeat
181190
h := h and (HASHSIZE - 1);
182191
x := StationHash[h];
183192
if x = 0 then
184193
break; // void slot
185194
result := @Station[x - 1];
186195
if (result^.NameLen = namelen) and
187-
({$ifdef CUSTOMASM}MemEqual{$else}MemCmp{$endif}(
188-
@result^.NameText, name, namelen) = 0) then
196+
CompareMem(@result^.NameText, name, namelen) then
189197
exit; // found
190198
inc(h); // hash collision: try next slot
191199
until false;
@@ -460,7 +468,7 @@ function ByStationName(const A, B): integer;
460468
result := sa.NameLen - sb.NameLen;
461469
end;
462470

463-
function Average(sum, count: PtrInt): integer;
471+
function Average(sum, count: PtrInt): PtrInt;
464472
// sum and result are temperature * 10 (one fixed decimal)
465473
var
466474
x, t: PtrInt; // temperature * 100 (two fixed decimals)

0 commit comments

Comments
 (0)