Skip to content

Commit 9b5aeb7

Browse files
author
Arnaud Bouchez
committed
fixed mORMot / abouchez proposal as requested for proper integration
1 parent 8fe8623 commit 9b5aeb7

File tree

3 files changed

+33
-29
lines changed

3 files changed

+33
-29
lines changed

entries/abcz/README.md renamed to entries/abouchez/README.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# mORMot version of The One Billion Row Challenge
1+
# mORMot version of The One Billion Row Challenge by Arnaud Bouchez
22

33
## mORMot 2 is Required
44

@@ -37,13 +37,13 @@ The "64 bytes cache line" trick is quite unique among all implementations of the
3737

3838
## Usage
3939

40-
If you execute the `mormot` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities):
40+
If you execute the `abouchez` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities):
4141

4242
```
43-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot
43+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez
4444
The mORMot One Billion Row Challenge
4545
46-
Usage: mormot <filename> [options] [params]
46+
Usage: abouchez <filename> [options] [params]
4747
4848
<filename> the data source filename
4949
@@ -62,10 +62,10 @@ We will use these command-line switches for local (dev PC), and benchmark (chall
6262

6363
On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads.
6464

65-
Let's compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
65+
Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
6666

6767
```
68-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt
68+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel5.txt
6969
7070
real 0m4,216s
7171
user 0m38,789s
@@ -77,13 +77,13 @@ real 0m25,330s
7777
user 6m44,853s
7878
sys 0m31,167s
7979
```
80-
We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results for each program on our PC.
80+
We used 20 threads for `sbalazs`, and 10 threads for `abouchez` because it was giving the best results for each program on our PC.
8181

82-
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
82+
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
8383

84-
The `memmap` feature makes the initial `mormot` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
84+
The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
8585
```
86-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel4.txt
86+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel4.txt
8787
8888
real 0m6,042s
8989
user 0m53,699s
@@ -93,11 +93,11 @@ This is the expected behavior, and will be fine with the benchmark challenge, wh
9393

9494
On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help:
9595
```
96-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v
96+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v
9797
Processing measurements.txt with 10 threads and affinity=false
9898
result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1
9999
done in 4.25s 3.6 GB/s
100-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v -a
100+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a
101101
Processing measurements.txt with 10 threads and affinity=true
102102
result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1
103103
done in 4.42s 3.5 GB/s
@@ -115,13 +115,13 @@ So we first need to find out which options leverage at best the hardware it runs
115115
On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which is a Ryzen 9 5950x with 16 cores / 32 threads and 64MB of L3 cache, each thread using around 2.5MB of its own data, we should try several options with 16-24-32 threads, for instance:
116116

117117
```
118-
./mormot measurements.txt -v -t=8
119-
./mormot measurements.txt -v -t=16
120-
./mormot measurements.txt -v -t=24
121-
./mormot measurements.txt -v -t=32
122-
./mormot measurements.txt -v -t=16 -a
123-
./mormot measurements.txt -v -t=24 -a
124-
./mormot measurements.txt -v -t=32 -a
118+
./abouchez measurements.txt -v -t=8
119+
./abouchez measurements.txt -v -t=16
120+
./abouchez measurements.txt -v -t=24
121+
./abouchez measurements.txt -v -t=32
122+
./abouchez measurements.txt -v -t=16 -a
123+
./abouchez measurements.txt -v -t=24 -a
124+
./abouchez measurements.txt -v -t=32 -a
125125
```
126126
Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here.
127127

@@ -133,6 +133,6 @@ Stay tuned!
133133

134134
## Ending Note
135135

136-
There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little bit overhead.
136+
There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
137137

138138
Arnaud :D

entries/abcz/src/brcmormot.lpi renamed to entries/abouchez/src/brcmormot.lpi

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
<CompilerOptions>
2020
<Version Value="11"/>
2121
<Target>
22-
<Filename Value="../../../bin/mormot"/>
22+
<Filename Value="../../../bin/abouchez"/>
2323
</Target>
2424
<SearchPaths>
2525
<IncludeFiles Value="$(ProjOutDir)"/>
@@ -53,7 +53,7 @@
5353
<CompilerOptions>
5454
<Version Value="11"/>
5555
<Target>
56-
<Filename Value="../../../bin/mormot"/>
56+
<Filename Value="../../../bin/abouchez"/>
5757
</Target>
5858
<SearchPaths>
5959
<IncludeFiles Value="$(ProjOutDir)"/>
@@ -97,7 +97,7 @@
9797
<CompilerOptions>
9898
<Version Value="11"/>
9999
<Target>
100-
<Filename Value="../../../bin/mormot"/>
100+
<Filename Value="../../../bin/abouchez"/>
101101
</Target>
102102
<SearchPaths>
103103
<IncludeFiles Value="$(ProjOutDir)"/>

entries/abcz/src/brcmormot.lpr renamed to entries/abouchez/src/brcmormot.lpr

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ constructor TBrcThread.Create(owner: TBrcMain);
245245
procedure TBrcThread.Execute;
246246
var
247247
p, start, stop: PByteArray;
248-
v: integer;
248+
v, m: integer;
249249
l, neg: PtrInt;
250250
s: PBrcStation;
251251
{$ifndef CUSTOMHASH}
@@ -300,12 +300,16 @@ procedure TBrcThread.Execute;
300300
{$else}
301301
s := fList.Search(@name);
302302
{$endif CUSTOMHASH}
303-
inc(s^.Count);
304-
if v < s^.Min then
305-
s^.Min := v;
306-
if v > s^.Max then
307-
s^.Max := v;
308303
inc(s^.Sum, v);
304+
inc(s^.Count);
305+
m := s^.Min;
306+
if v < m then
307+
m := v; // branchless cmovl
308+
s^.Min := m;
309+
m := s^.Max;
310+
if v > m then
311+
m := v;
312+
s^.Max := m;
309313
until p >= stop;
310314
end;
311315
// aggregate this thread values into the main list

0 commit comments

Comments
 (0)