You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -62,10 +62,10 @@ We will use these command-line switches for local (dev PC), and benchmark (chall
62
62
63
63
On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads.
64
64
65
-
Let's compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
65
+
Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
66
66
67
67
```
68
-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt
68
+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel5.txt
69
69
70
70
real 0m4,216s
71
71
user 0m38,789s
@@ -77,13 +77,13 @@ real 0m25,330s
77
77
user 6m44,853s
78
78
sys 0m31,167s
79
79
```
80
-
We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results for each program on our PC.
80
+
We used 20 threads for `sbalazs`, and 10 threads for `abouchez` because it was giving the best results for each program on our PC.
81
81
82
-
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
82
+
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
83
83
84
-
The `memmap` feature makes the initial`mormot` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
84
+
The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
85
85
```
86
-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel4.txt
86
+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel4.txt
87
87
88
88
real 0m6,042s
89
89
user 0m53,699s
@@ -93,11 +93,11 @@ This is the expected behavior, and will be fine with the benchmark challenge, wh
93
93
94
94
On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help:
Processing measurements.txt with 10 threads and affinity=false
98
98
result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1
99
99
done in 4.25s 3.6 GB/s
100
-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v -a
100
+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a
101
101
Processing measurements.txt with 10 threads and affinity=true
102
102
result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1
103
103
done in 4.42s 3.5 GB/s
@@ -115,13 +115,13 @@ So we first need to find out which options leverage at best the hardware it runs
115
115
On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which is a Ryzen 9 5950x with 16 cores / 32 threads and 64MB of L3 cache, each thread using around 2.5MB of its own data, we should try several options with 16-24-32 threads, for instance:
116
116
117
117
```
118
-
./mormot measurements.txt -v -t=8
119
-
./mormot measurements.txt -v -t=16
120
-
./mormot measurements.txt -v -t=24
121
-
./mormot measurements.txt -v -t=32
122
-
./mormot measurements.txt -v -t=16 -a
123
-
./mormot measurements.txt -v -t=24 -a
124
-
./mormot measurements.txt -v -t=32 -a
118
+
./abouchez measurements.txt -v -t=8
119
+
./abouchez measurements.txt -v -t=16
120
+
./abouchez measurements.txt -v -t=24
121
+
./abouchez measurements.txt -v -t=32
122
+
./abouchez measurements.txt -v -t=16 -a
123
+
./abouchez measurements.txt -v -t=24 -a
124
+
./abouchez measurements.txt -v -t=32 -a
125
125
```
126
126
Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here.
127
127
@@ -133,6 +133,6 @@ Stay tuned!
133
133
134
134
## Ending Note
135
135
136
-
There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little bit overhead.
136
+
There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
0 commit comments