Skip to content

Commit 02e6201

Browse files
authored
Merge branch 'gcarreno:main' into main
2 parents fc9780e + 7e40fe6 commit 02e6201

26 files changed

+1500
-363
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ These are the results from running all entries into the challenge on my personal
173173

174174
| # | Result (m:s.ms) | Compiler | Submitter | Notes | Certificates |
175175
|--:|----------------:|---------:|:----------|:------|:-------------|
176-
| 1 | 0:2.472 | lazarus-3.0, fpc-3.2.2 | Arnaud Bouchez | Using 16 threads | |
176+
| 1 | 0:1.861 | lazarus-3.0, fpc-3.2.2 | Arnaud Bouchez | Using 32 threads | |
177177
| 2 | 0:16.874 | lazarus-3.0, fpc-3.2.2 | Székely Balázs | Using 16 threads | |
178178
| 3 | 0:20.046 | lazarus-3.0, fpc-3.2.2 | Lurendrejer Aksen | using 30 thread | |
179179
| 4 | 1:16.059 | lazarus-3.0, fpc-3.2.2 | Richard Lawson | Using 1 thread | |

entries/abouchez/README.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,13 @@ Here are the main ideas behind this implementation proposal:
2525
- Process file in parallel using several threads (configurable, with `-t=16` by default);
2626
- Fed each thread from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads);
2727
- Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated;
28-
- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, to match the CPU L1 cache size for efficiency;
29-
- Use a dedicated hash table for the name lookup, with direct crc32c SSE4.2 hash - when `TDynArrayHashed` is involved, it requires a transient name copy on the stack, which is noticeably slower (see last paragraph of this document);
28+
- Each station information (name and values) is packed into a record of exactly 16 bytes, with no external pointer/string, to match the CPU L1 cache size (64 bytes) for efficiency;
29+
- Use a dedicated hash table for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed;
3030
- Store values as 16-bit or 32-bit integers (i.e. temperature multiplied by 10);
3131
- Parse temperatures with a dedicated code (expects single decimal input values);
3232
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
3333
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target);
34-
- Some dedicated x86_64 asm has been written to replace *mORMot* `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have);
35-
- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch);
34+
- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
3635
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
3736

3837
## Why L1 Cache Matters
@@ -41,9 +40,11 @@ The "64 bytes cache line" trick is quite unique among all implementations of the
4140

4241
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
4342

44-
We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
43+
Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
4544

46-
In our first attempt, the `Station[]` array was in fact not aligned to 64 bytes itself. In fact, the RTL `SetLength()` does not align its data to the item size. So the pointer was aligned by 32 bytes, and any memory access would require filling two L1 cache lines. So we added some manual alignement of the data structure, and got 5% better performance.
45+
In our first attempt, we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, we could just store the 32-bit hash instead, for higher performance. On Intel/AMD/AARCH64 CPUs, we use hardware opcodes for this crc32c computation.
46+
47+
See https://en.wikipedia.org/wiki/Perfect_hash_function for reference.
4748

4849
## Usage
4950

@@ -75,42 +76,43 @@ On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 thread
7576
Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
7677

7778
```
78-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel5.txt
79+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=20 >resmormot.txt
7980
80-
real 0m4,216s
81-
user 0m38,789s
82-
sys 0m0,632s
81+
real 0m2,350s
82+
user 0m40,165s
83+
sys 0m0,888s
8384
84-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./sbalazs measurements.txt 20 >ressb6.txt
85+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./sbalazs measurements.txt 20 >ressb.txt
8586
8687
real 0m25,330s
8788
user 6m44,853s
8889
sys 0m31,167s
8990
```
90-
We used 20 threads for `sbalazs`, and 10 threads for `abouchez` because it was giving the best results for each program on our PC.
91+
We used 20 threads for both executable, because it was giving the best results for each program on our PC.
9192

9293
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
9394

9495
The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
9596
```
96-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel4.txt
97+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=20 >resmormot.txt
9798
9899
real 0m6,042s
99100
user 0m53,699s
100101
sys 0m2,941s
101102
```
102103
This is the expected behavior, and will be fine with the benchmark challenge, which ignores the min and max values during its 10 times run. So the first run will just warm up the file into memory.
103104

104-
On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help:
105+
On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not make any huge difference (we are within the error margin):
105106
```
106107
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v
107-
Processing measurements.txt with 10 threads and affinity=false
108+
Processing measurements.txt with 20 threads and affinity=false
108109
result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1
109-
done in 4.25s 3.6 GB/s
110+
done in 2.36s 6.6 GB/s
111+
110112
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a
111-
Processing measurements.txt with 10 threads and affinity=true
113+
Processing measurements.txt with 20 threads and affinity=true
112114
result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1
113-
done in 4.42s 3.5 GB/s
115+
done in 2.44s 6.4 GB/s
114116
```
115117
Affinity may help on Ryzen 9, because its Zen 3 architecture is made of identical 16 cores with 32 threads, not this Intel E/P cores mess. But we will validate that on real hardware - no premature guess!
116118

@@ -145,6 +147,8 @@ This `-t=1` run is for fun: it will run the process in a single thread. It will
145147

146148
Our proposal has been run on the benchmark hardware, using the full automation.
147149

150+
TO BE COMPLETED - NUMBERS BELOW ARE FOR THE OLD VERSION:
151+
148152
With 30 threads (on a busy system):
149153
```
150154
-- SSD --
@@ -191,7 +195,9 @@ It may be as expected:
191195
- The Ryzen CPU has 16 cores with 32 threads, and it makes sense that using only the "real" cores with CPU+RAM intensive work is enough to saturate them;
192196
- It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Linux Operating System schedule the threads to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen.
193197

194-
## Ending Note
198+
## Old Version
199+
200+
TO BE DETAILED (WITH NUMBERS?)
195201

196202
You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly.
197203

0 commit comments

Comments
 (0)