Skip to content

Commit e825482

Browse files
committed
FPC v2, single-threaded
1 parent 68d70ce commit e825482

File tree

5 files changed

+215
-88
lines changed

5 files changed

+215
-88
lines changed

entries/ghatem-fpc/README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Georges Hatem
2+
3+
## Requirements
4+
- mORMot2 library
5+
- 64-bit compilation
6+
7+
## Hardware + Environment
8+
host:
9+
- Dell XPS 15 (9560, 2017)
10+
- OS: ArchLinux
11+
- CPU: Intel i7 7700 HQ (4 Cores, 8 Threads @2.80-3.80 GHz, Kaby Lake)
12+
- 32 GB RAM DDR4 (2400 MHz)
13+
- 1 TB SSD NVME
14+
15+
VM (VirtualBox):
16+
- OS: Windows
17+
- CPU count: 4 out of 8 (threads, probably)
18+
- 20 GB RAM
19+
20+
note about the hash:
21+
run with DEBUG compiler directive to write from stream directly to file, otherwise the hash will not match.
22+
23+
## Baseline
24+
the initial implementation (the Delphi baseline found in /baseline) aimed to get a correct output, regardless of performance:
25+
"Make it work, then make it work better".
26+
It turns out even the baseline caused some trouble, namely the `Ceil` implementation was yielding different results between FPC and Delphi (and different results between Delphi Win32/Win64).
27+
After the input of several peers including gcarreno, abouchez and paweld (thanks!), this last detail was ironed out, and the baseline yielded a matching hash.
28+
29+
## Single-Threaded Attempt (2024-04-03)
30+
31+
in this first attempt, the implementation is broken down into 3 major steps:
32+
1. read the input file
33+
2. process it
34+
3. output to file/stdout
35+
36+
a key point:
37+
- the reading / writing (steps 1 an 3) will be done on the main thread.
38+
- the processing (step 2) is where a future submission will attempt to parallelize the work done.
39+
40+
### 1. Read The Input File
41+
42+
#### v1. File Stream
43+
In the baseline implementation, a file stream is used to read line by line, which is far from optimal.
44+
45+
#### v2. Memory-Mapped File
46+
An improvement was to read the whole file into memory in one shot, using memory mapping.
47+
In this implementation, I use `GetFileSize` and `CreateFileMapping`, procedure found online (need to find URL).
48+
First thing to note is, the usable memory by a Win32 process is limited to ~1.5-2 GB of RAM. Exceeding this limit yields an out-of-memory exception, so we must compile for Win64.
49+
Some issues with this implementation (see unit FileReader.pas):
50+
- `GetFileSize` was returning a size of ~3.9B, while we know the real input is ~16B.
51+
- `CreateFileMapping` was taking 2.5 seconds to read the file into a `TArray<Utf8Char>` the first time, and subsequent reads were down to 1.7 seconds. But this is for only a quarter of the input size.
52+
- using `GetFileSizeEx` instead, we now get the real file size, ~16B
53+
- however, `CreateFileMapping` accepts as parameters `Cardinal` types, so a value of ~16B (Int64) would yield a `range check` error.
54+
- if we wanted to move forward with this implementation, we would need to call `CreateFileMapping` in 4 or 5 batches, which would take 1.7 x 5 ~= 8.5 seconds just to read the data.
55+
- attempt aborted, see v3.
56+
57+
#### v3. Memory-Mapped File, Provided by `mORMot2`
58+
A v3 attempt at reading the file was using a ready-made implementation of file memory-mapping, provided by synopse/mORMot, big thanks @abouchez!
59+
The function returns a pAnsiChar and the size as Int64 of the mapped data. Performance-wise, it all happens in under 0.1 seconds, but now we must delve into pointers.
60+
61+
62+
### 2. Process the File
63+
64+
Well, at a glance this is straightforward:
65+
- look for new-line chars to delimit each line, split it to extract StationName / Temperature.
66+
- Decode the temperature into a numerical value (double, or integer x 10)
67+
- accumulate the information into a dictionary of StationName -> Record of data
68+
69+
A few optimizations done here, to the best of my knowledge:
70+
71+
#### For Each line, Iterate Backwards
72+
`Length(StationName) > Length(Temperature)`, so for each line, better look for the `;` starting from the end.
73+
Given the below input:
74+
```
75+
Rock Hill;-54.3
76+
^^^
77+
```
78+
the 3 last characters will be mandatorily present, so we can skip them while iterating.
79+
I tried unrolling the loop over the last 2-3 characters that must be checked, but measuring it, it showed to be slower, don't know why.
80+
81+
#### Extract the Strings Using `SetString`
82+
manual string concatenation and splitting proved to be very slow.
83+
Using `SetString` yielded a noticeable improvement. Remaining in the realm of pointers would probably be much faster, but I haven't ventured there (yet, maybe if time is available).
84+
85+
#### Decode Temperature Into Numerical Value
86+
First attempt was to use `StrToFloat`, which was pretty catastrophic. Using `Val` was still slow, but definitely a step-up. `Val` with SmallInt proved to be faster than with Double, even though there's extra operations to be done.
87+
So now we need to get rid of the `.` character.
88+
89+
Again, string functions being very slow, replicating the last character at length-1, and then reduce the length of the AnsiString using `SetLength` yielded faster results. I tried doing so with pAnsiChar but got some range-check errors. Might get back to it later on.
90+
91+
Finally, assuming temperatures in the range `-100` and `+100`, with 1 decimal figure, there should be 2000 different temperatures possible.
92+
Instead of decoding the same temperature values using `Val`, do it once, store it in a TDictionary (TemperatureString -> TemperatureSmallInt). There were I believe 1998 different temperature values, so we only call `Val` 1998 times instead of 1 billion times. Over an input size of 100M, the gain was 4-5 seconds (total 28s -> 23s)
93+
94+
#### Accumulate Data Into a Dictionary of Records
95+
- the records are packed, with minimal size
96+
- the dictionary maps StationName -> Pointer to record, to avoid passing around full records
97+
- records are pre-allocated in an array of 45,000, instead of allocating them on-the-fly.
98+
- when a station is not found in the dictionary, we point to the next element in the records-array.
99+
- with an input of size 100M, this accumulation step takes a considerable amount of time (9 seconds out of 23 total). I haven't identified yet if it is the `dict.Add` that takes time, the `dict.TryGetValue`, or just generally the dictionary hash collisions. Even though the dictionary is pre-allocated with a capacity of 45,000, but that did not seem to improve much. I also tried the dictionary implementation of Spring4D, but also no improvements.
100+
101+
### 3. Output the Results
102+
Since I started using pointers (pAnsiChar), getting a matching hash was a bit of a pickle:
103+
Some Unicode characters were messed up in their display, or messed up in their ordering.
104+
Eventually, the ordering issue was resolved by using `AnsiStrings.CompareStr` instead of `SysUtils.CompareStr`. This step will clearly remain single-threaded, but takes 0.15 seconds for all 45,000 stations, so it is not a big deal.
105+
106+
### Conclusion
107+
108+
If there are any improvements to be done in this single-threaded version, they would be in the following order, from most impactful to least impactful (performance numbers over an input of 100M, not 1B):
109+
- try improve storing / retrieval from the dictionary of station data, cost: 9 sec / 23 sec
110+
- try improve the extraction of string data, maybe using pointers (6.5 sec / 23 sec)
111+
- try improve the type conversion, though not sure how at this point (4.5 sec / 23 sec)
112+
- somehow, incrementing an integer 1B times is taking 1.2 seconds, while incrementing the main input index (16B times) is only taking 0.5 second. It's just 1.2 seconds, but I'm not understanding why it should behave that way.
113+
114+
115+
## Single-Threaded Attempt (2024-04-09)
116+
117+
Turns out mORMot2 does not compile under Delphi for Linux x64 due to poor Delphi support for Linux.
118+
This implementation was migrated as-is with Lazarus (FPC), which is much slower than Delphi on Windows x64.
119+
120+
**Timing as per gcarreno: 176 seconds, single-threaded**
121+
122+
Since then, a few improvements:
123+
124+
### better temperature conversion to SmallInt
125+
126+
In the previous implementation, we identified `Val` and `StrToFloat` as a source of slowness. @abouchez suggested that the numerical value could be obtained without those.
127+
Indeed, by using Ord of the needed characters, offsetting by the ASCII value of '0', and multiplying by the right power of 10, the temperature can be extracted as integer at a fraction of the cost.
128+
129+
### better usage of dictionary data-structure:
130+
131+
Instead of extracting the station name as a string 1B times, and use it as a dictionary key, instead use the FPC hashing function to reduce the string extraction to ~45k times (once per station), and otherwise use the pointer + length (station_name) to compute the hash.
132+
This requires us to migrate from TFPHashList to the generic TDictionary. Even though TDictionary is slower than TFPHashList, the overall improvements yielded a significant performance gain.
133+
Using TDictionary, the code is more similar to my Delphi version, in case we get to run those tests on a Windows PC.
134+
135+
** expected timing: ~60 seconds, single-threaded**

entries/ghatem-fpc/src/OneBRCproj.lpi

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@
5050
<UseExternalDbgSyms Value="True"/>
5151
</Debugging>
5252
</Linking>
53+
<Other>
54+
<CustomOptions Value="-dDEBUG"/>
55+
</Other>
5356
</CompilerOptions>
5457
</Item2>
5558
<Item3 Name="Release">
@@ -76,6 +79,9 @@
7679
</Debugging>
7780
<LinkSmart Value="True"/>
7881
</Linking>
82+
<Other>
83+
<CustomOptions Value="-dRELEASE"/>
84+
</Other>
7985
</CompilerOptions>
8086
</Item3>
8187
</BuildModes>
@@ -121,6 +127,9 @@
121127
<DebugInfoType Value="dsDwarf3"/>
122128
</Debugging>
123129
</Linking>
130+
<Other>
131+
<CustomOptions Value="-dDEBUG"/>
132+
</Other>
124133
</CompilerOptions>
125134
<Debugging>
126135
<Exceptions Count="3">

entries/ghatem-fpc/src/OneBRCproj.lpr

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -83,12 +83,11 @@ procedure TOneBRCApp.DoRun;
8383
vOneBRC := TOneBRC.Create;
8484
try
8585
try
86-
//vOneBRC.mORMotMMF(vFileName);
87-
//vOneBRC.SingleThread;
88-
//vOneBRC.GenerateOutput;
89-
//vOneBRC.Free;
86+
vOneBRC.mORMotMMF(vFileName);
87+
vOneBRC.SingleThread;
88+
vOneBRC.GenerateOutput;
9089

91-
vStart := GetTickCount;
90+
{vStart := GetTickCount;
9291
vOneBRC.mORMotMMF (vFileName);
9392
vTime := GetTickCount - vStart;
9493
WriteLn('read: ' + FloatToStr(vTime / 1000));
@@ -102,15 +101,13 @@ procedure TOneBRCApp.DoRun;
102101
WriteLn('-----------');
103102
WriteLn;
104103
105-
106-
107104
vStart := GetTickCount;
108105
vOneBRC.GenerateOutput;
109106
vTime := GetTickCount - vStart;
110107
WriteLn('generate: ' + FloatToStr(vTime / 1000));
111108
WriteLn('-----------');
112109
WriteLn;
113-
ReadLn;
110+
ReadLn; }
114111
except
115112
on E: Exception do
116113
begin

0 commit comments

Comments
 (0)