Skip to content

Commit 37acbc9

Browse files
authored
Merge pull request #109 from georges-hatem/main
extra docs. thread count as cmdline param
2 parents 7770fea + 4931ddb commit 37acbc9

File tree

4 files changed

+39
-9
lines changed

4 files changed

+39
-9
lines changed

entries/ghatem-fpc/README.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,9 @@ Instead of extracting the station name as a string 1B times, and use it as a dic
132132
This requires us to migrate from TFPHashList to the generic TDictionary. Even though TDictionary is slower than TFPHashList, the overall improvements yielded a significant performance gain.
133133
Using TDictionary, the code is more similar to my Delphi version, in case we get to run those tests on a Windows PC.
134134

135-
* expected timing: ~60 seconds, single-threaded*
136-
** ACTUAL TIMING: 58 seconds as per gcarreno **
135+
*expected timing: ~60 seconds, single-threaded*
136+
137+
**ACTUAL TIMING: 58 seconds as per gcarreno**
137138

138139

139140
## Multi-Threaded Attempt (2024-04-10)
@@ -184,7 +185,25 @@ Better wait and see the results on the real environment, before judging.
184185

185186
### results
186187

187-
** ACTUAL TIMING: 6.042 seconds as per gcarreno **
188+
**ACTUAL TIMING: 6.042 seconds as per gcarreno**
188189

189190
Due to the unexpectedly slow performance on Craig Chapman's powerful computer, and since the results above intrigued me, I have ported my FPC code onto Delphi to be able to compare the output of both compilers on Windows x64.
190-
Hopefully, it will help identify is the issue stems from Windows x64 or FPC, in multi-threaded implementations.
191+
Hopefully, it will help identify is the issue stems from Windows x64 or FPC, in multi-threaded implementations.
192+
193+
## Multi-Threaded attempt v.2 (2024-04-16)
194+
195+
On my Linux setup (FPC, no VM), I realize that as the number of cores increases, the performance improvement is far from linear:
196+
197+
- 1 core: 77 seconds
198+
- 2 cores: 50 seconds
199+
- 4 cores: 35 seconds
200+
201+
In order to identify the source of the problem, I first forced all threads to write to un-protected shared-memory (the results are wrong, of course). The idea is to try to understand if the source of the problem stems from ~45k records being created for each thread, and if the retrieval of those records are causing too many cache misses.
202+
203+
with 4 cores, we're now at ~21 seconds, which is much closer to linear performance. In order to make sure the threads are approximately getting a balanced load, each thread is initially assigned a range of data, and as soon as they are done with their given range, they request the next available range. I've tried with varying range sizes, but the result was always slower than just a plain `N / K` distribution.
204+
205+
So the problem (on my computer at least) does not seem to be related to the load imbalance between threads, but rather having to swap all those records from the cache.
206+
207+
As a last attempt, I tried again accumulating data in a shared memory, protecting all data accumulation with `InterlockedInc`, `InterlockedExchangeAdd`, and `TCriticalSection`. In order to avoid too many contentions on the critical section, I also tried to maintain a large array of critical sections, acquiring only the index for which we are accumulating data. All of these attempts under-performed on 4 threads, and likely will perform even worse as thread-count increases. The only way this would work is by having finer-grained control over the locking, such that a thread would only be blocked if it tried to write into a record that is already locked.
208+
209+
Lastly, the `TDictionary.TryGetValue` has shown to be quite costly, around `1/4th` of the total cost. And although it is currently so much better than when using the station name as key, evaluating the `mod` of all those hashes, there is a lot of collisions. So if the dictionary key-storage is implemented as an array, and `mod` is used to transform those `CRC32` into indexes ranging in `[0, 45k]`, those collisions will be the cause of slowness. If there is a way to reduce the number of collisions, then maybe a custom dictionary implementation might help.

entries/ghatem-fpc/src/OneBRCproj.lpr

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
TOneBRCApp = class(TCustomApplication)
1818
private
1919
FFileName: string;
20+
FThreadCount: Integer;
2021
procedure RunOneBRC;
2122
protected
2223
procedure DoRun; override;
@@ -34,7 +35,7 @@ procedure TOneBRCApp.RunOneBRC;
3435
vStart: Int64;
3536
vTime: Int64;
3637
begin
37-
vOneBRC := TOneBRC.Create (32);
38+
vOneBRC := TOneBRC.Create (FThreadCount);
3839
try
3940
try
4041
vOneBRC.mORMotMMF(FFileName);
@@ -89,13 +90,15 @@ procedure TOneBRCApp.DoRun;
8990
ErrorMsg: String;
9091
begin
9192
// quick check parameters
92-
ErrorMsg:= CheckOptions(Format('%s%s%s:',[
93+
ErrorMsg:= CheckOptions(Format('%s%s%s%s:',[
9394
cShortOptHelp,
95+
cShortOptThread,
9496
cShortOptVersion,
9597
cShortOptInput
9698
]),
9799
[
98100
cLongOptHelp,
101+
cLongOptThread+':',
99102
cLongOptVersion,
100103
cLongOptInput+':'
101104
]
@@ -120,6 +123,11 @@ procedure TOneBRCApp.DoRun;
120123
Exit;
121124
end;
122125

126+
FThreadCount := 32;
127+
if HasOption(cShortOptThread, cLongOptThread) then begin
128+
FThreadCount := StrToInt (GetOptionValue(cShortOptThread, cLongOptThread));
129+
end;
130+
123131
if HasOption(cShortOptInput, cLongOptInput) then begin
124132
FFileName := GetOptionValue(
125133
cShortOptInput,

entries/ghatem-fpc/src/baseline.console.pas

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ interface
1919
cLongOptVersion = 'version';
2020
cShortOptInput: Char = 'i';
2121
cLongOptInput = 'input-file';
22+
cShortOptThread: Char = 't';
23+
cLongOptThread = 'threads';
2224
{$ELSE}
2325
cOptionHelp: array of string = ['-h', '--help'];
2426
cOptionVersion: array of string = ['-v', '--version'];
@@ -63,6 +65,7 @@ procedure WriteHelp;
6365
WriteLn(' -h|--help Writes this help message and exits');
6466
WriteLn(' -v|--version Writes the version and exits');
6567
WriteLn(' -i|--input-file <filename> The file containing the Weather Stations');
68+
WriteLn(' -t|--threads <threadcount> The number of threads to be used, default 32');
6669
end;
6770

6871
{$IFNDEF FPC}

entries/ghatem-fpc/src/onebrc.pas

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -135,14 +135,14 @@ procedure TOneBRC.ExtractLineData(const aStart: Int64; const aEnd: Int64; out aL
135135
aTemp := (Ord(FData[aEnd]) - c0ascii)
136136
+ 10 *(Ord(FData[aEnd-2]) - c0ascii);
137137
vDigit := Ord(FData[aEnd-3]);
138-
if (vDigit >= c0ascii) and (vDigit <= c9ascii) then begin
138+
if vDigit >= c0ascii then begin
139139
aTemp := aTemp + 100*(Ord(FData[aEnd-3]) - c0ascii);
140140
vDigit := Ord(FData[aEnd-4]);
141141
if vDigit = cNegAscii then
142-
aTemp := -1 * aTemp;
142+
aTemp := -aTemp;
143143
end
144144
else if vDigit = cNegAscii then
145-
aTemp := -1 * aTemp;
145+
aTemp := -aTemp;
146146
end;
147147

148148
//---------------------------------------------------

0 commit comments

Comments
 (0)