You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: entries/abcz/README.md
+10-12Lines changed: 10 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,17 +19,17 @@ I am very happy to share decades of server-side performance coding techniques us
19
19
Here are the main ideas behind this implementation proposal:
20
20
21
21
- **mORMot** makes cross-platform and cross-compiler support simple (e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing);
22
-
- Memory map the entire 16GB file at once (so won't work on 32-bit OS, but reduce syscalls);
22
+
- Will memmap the entire 16GB file at once into memory (so won't work on 32-bit OS, but reduce syscalls);
23
23
- Process file in parallel using several threads (configurable, with `-t=16` by default);
24
-
- Each thread is fed from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads);
24
+
- Fed each thread from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads);
25
25
- Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated;
26
-
- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, so match the CPU L1 cache size for efficiency;
26
+
- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, to match the CPU L1 cache size for efficiency;
27
27
- Use a dedicated hash table for the name lookup, with direct crc32c SSE4.2 hash - when `TDynArrayHashed` is involved, it requires a transient name copy on the stack, which is noticeably slower (see last paragraph of this document);
28
-
- Store values as 16-bit or 32-bit integers (temperature multiplied by 10);
28
+
- Store values as 16-bit or 32-bit integers (i.e. temperature multiplied by 10);
29
29
- Parse temperatures with a dedicated code (expects single decimal input values);
30
30
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
31
-
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) with no SIMD involved;
32
-
- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents;
31
+
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target);
32
+
- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have);
33
33
- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch);
34
34
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
35
35
@@ -60,11 +60,9 @@ We will use these command-line switches for local (dev PC), and benchmark (chall
60
60
61
61
## Local Analysis
62
62
63
-
On my PC, it takes less than 5 seconds to process the 16GB file with 8 threads.
63
+
On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads.
64
64
65
-
If we use the `time` command on Linux, we can see that there is little time spend in kernel (sys) land.
66
-
67
-
If we compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`):
65
+
Let's compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
68
66
69
67
```
70
68
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt
@@ -79,7 +77,7 @@ real 0m25,330s
79
77
user 6m44,853s
80
78
sys 0m31,167s
81
79
```
82
-
We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results on each entry on this particular PC.
80
+
We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results for each program on our PC.
83
81
84
82
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
85
83
@@ -125,7 +123,7 @@ On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which i
125
123
./mormot measurements.txt -v -t=24 -a
126
124
./mormot measurements.txt -v -t=32 -a
127
125
```
128
-
Please run those command lines, to guess which parameters are to be run for the benchmark to give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here.
126
+
Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here.
0 commit comments