You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: entries/ghatem-fpc/README.md
+23-4Lines changed: 23 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -132,8 +132,9 @@ Instead of extracting the station name as a string 1B times, and use it as a dic
132
132
This requires us to migrate from TFPHashList to the generic TDictionary. Even though TDictionary is slower than TFPHashList, the overall improvements yielded a significant performance gain.
133
133
Using TDictionary, the code is more similar to my Delphi version, in case we get to run those tests on a Windows PC.
134
134
135
-
* expected timing: ~60 seconds, single-threaded*
136
-
** ACTUAL TIMING: 58 seconds as per gcarreno **
135
+
*expected timing: ~60 seconds, single-threaded*
136
+
137
+
**ACTUAL TIMING: 58 seconds as per gcarreno**
137
138
138
139
139
140
## Multi-Threaded Attempt (2024-04-10)
@@ -184,7 +185,25 @@ Better wait and see the results on the real environment, before judging.
184
185
185
186
### results
186
187
187
-
**ACTUAL TIMING: 6.042 seconds as per gcarreno**
188
+
**ACTUAL TIMING: 6.042 seconds as per gcarreno**
188
189
189
190
Due to the unexpectedly slow performance on Craig Chapman's powerful computer, and since the results above intrigued me, I have ported my FPC code onto Delphi to be able to compare the output of both compilers on Windows x64.
190
-
Hopefully, it will help identify is the issue stems from Windows x64 or FPC, in multi-threaded implementations.
191
+
Hopefully, it will help identify is the issue stems from Windows x64 or FPC, in multi-threaded implementations.
192
+
193
+
## Multi-Threaded attempt v.2 (2024-04-16)
194
+
195
+
On my Linux setup (FPC, no VM), I realize that as the number of cores increases, the performance improvement is far from linear:
196
+
197
+
- 1 core: 77 seconds
198
+
- 2 cores: 50 seconds
199
+
- 4 cores: 35 seconds
200
+
201
+
In order to identify the source of the problem, I first forced all threads to write to un-protected shared-memory (the results are wrong, of course). The idea is to try to understand if the source of the problem stems from ~45k records being created for each thread, and if the retrieval of those records are causing too many cache misses.
202
+
203
+
with 4 cores, we're now at ~21 seconds, which is much closer to linear performance. In order to make sure the threads are approximately getting a balanced load, each thread is initially assigned a range of data, and as soon as they are done with their given range, they request the next available range. I've tried with varying range sizes, but the result was always slower than just a plain `N / K` distribution.
204
+
205
+
So the problem (on my computer at least) does not seem to be related to the load imbalance between threads, but rather having to swap all those records from the cache.
206
+
207
+
As a last attempt, I tried again accumulating data in a shared memory, protecting all data accumulation with `InterlockedInc`, `InterlockedExchangeAdd`, and `TCriticalSection`. In order to avoid too many contentions on the critical section, I also tried to maintain a large array of critical sections, acquiring only the index for which we are accumulating data. All of these attempts under-performed on 4 threads, and likely will perform even worse as thread-count increases. The only way this would work is by having finer-grained control over the locking, such that a thread would only be blocked if it tried to write into a record that is already locked.
208
+
209
+
Lastly, the `TDictionary.TryGetValue` has shown to be quite costly, around `1/4th` of the total cost. And although it is currently so much better than when using the station name as key, evaluating the `mod` of all those hashes, there is a lot of collisions. So if the dictionary key-storage is implemented as an array, and `mod` is used to transform those `CRC32` into indexes ranging in `[0, 45k]`, those collisions will be the cause of slowness. If there is a way to reduce the number of collisions, then maybe a custom dictionary implementation might help.
0 commit comments