|
| 1 | +# Georges Hatem |
| 2 | + |
| 3 | +## Requirements |
| 4 | + - mORMot2 library |
| 5 | + - 64-bit compilation |
| 6 | + |
| 7 | +## Hardware + Environment |
| 8 | +host: |
| 9 | + - Dell XPS 15 (9560, 2017) |
| 10 | + - OS: ArchLinux |
| 11 | + - CPU: Intel i7 7700 HQ (4 Cores, 8 Threads @2.80-3.80 GHz, Kaby Lake) |
| 12 | + - 32 GB RAM DDR4 (2400 MHz) |
| 13 | + - 1 TB SSD NVME |
| 14 | + |
| 15 | +VM (VirtualBox): |
| 16 | + - OS: Windows |
| 17 | + - CPU count: 4 out of 8 (threads, probably) |
| 18 | + - 20 GB RAM |
| 19 | + |
| 20 | +note about the hash: |
| 21 | +run with DEBUG compiler directive to write from stream directly to file, otherwise the hash will not match. |
| 22 | + |
| 23 | +## Baseline |
| 24 | +the initial implementation (the Delphi baseline found in /baseline) aimed to get a correct output, regardless of performance: |
| 25 | +"Make it work, then make it work better". |
| 26 | +It turns out even the baseline caused some trouble, namely the `Ceil` implementation was yielding different results between FPC and Delphi (and different results between Delphi Win32/Win64). |
| 27 | +After the input of several peers including gcarreno, abouchez and paweld (thanks!), this last detail was ironed out, and the baseline yielded a matching hash. |
| 28 | + |
| 29 | +## Single-Threaded Attempt (2024-04-03) |
| 30 | + |
| 31 | +in this first attempt, the implementation is broken down into 3 major steps: |
| 32 | + 1. read the input file |
| 33 | + 2. process it |
| 34 | + 3. output to file/stdout |
| 35 | + |
| 36 | +a key point: |
| 37 | + - the reading / writing (steps 1 an 3) will be done on the main thread. |
| 38 | + - the processing (step 2) is where a future submission will attempt to parallelize the work done. |
| 39 | + |
| 40 | +### 1. Read The Input File |
| 41 | + |
| 42 | +#### v1. File Stream |
| 43 | +In the baseline implementation, a file stream is used to read line by line, which is far from optimal. |
| 44 | + |
| 45 | +#### v2. Memory-Mapped File |
| 46 | +An improvement was to read the whole file into memory in one shot, using memory mapping. |
| 47 | +In this implementation, I use `GetFileSize` and `CreateFileMapping`, procedure found online (need to find URL). |
| 48 | +First thing to note is, the usable memory by a Win32 process is limited to ~1.5-2 GB of RAM. Exceeding this limit yields an out-of-memory exception, so we must compile for Win64. |
| 49 | +Some issues with this implementation (see unit FileReader.pas): |
| 50 | + - `GetFileSize` was returning a size of ~3.9B, while we know the real input is ~16B. |
| 51 | + - `CreateFileMapping` was taking 2.5 seconds to read the file into a `TArray<Utf8Char>` the first time, and subsequent reads were down to 1.7 seconds. But this is for only a quarter of the input size. |
| 52 | + - using `GetFileSizeEx` instead, we now get the real file size, ~16B |
| 53 | + - however, `CreateFileMapping` accepts as parameters `Cardinal` types, so a value of ~16B (Int64) would yield a `range check` error. |
| 54 | + - if we wanted to move forward with this implementation, we would need to call `CreateFileMapping` in 4 or 5 batches, which would take 1.7 x 5 ~= 8.5 seconds just to read the data. |
| 55 | + - attempt aborted, see v3. |
| 56 | + |
| 57 | +#### v3. Memory-Mapped File, Provided by `mORMot2` |
| 58 | +A v3 attempt at reading the file was using a ready-made implementation of file memory-mapping, provided by synopse/mORMot, big thanks @abouchez! |
| 59 | +The function returns a pAnsiChar and the size as Int64 of the mapped data. Performance-wise, it all happens in under 0.1 seconds, but now we must delve into pointers. |
| 60 | + |
| 61 | + |
| 62 | +### 2. Process the File |
| 63 | + |
| 64 | +Well, at a glance this is straightforward: |
| 65 | + - look for new-line chars to delimit each line, split it to extract StationName / Temperature. |
| 66 | + - Decode the temperature into a numerical value (double, or integer x 10) |
| 67 | + - accumulate the information into a dictionary of StationName -> Record of data |
| 68 | + |
| 69 | +A few optimizations done here, to the best of my knowledge: |
| 70 | + |
| 71 | +#### For Each line, Iterate Backwards |
| 72 | +`Length(StationName) > Length(Temperature)`, so for each line, better look for the `;` starting from the end. |
| 73 | +Given the below input: |
| 74 | +``` |
| 75 | +Rock Hill;-54.3 |
| 76 | + ^^^ |
| 77 | +``` |
| 78 | +the 3 last characters will be mandatorily present, so we can skip them while iterating. |
| 79 | +I tried unrolling the loop over the last 2-3 characters that must be checked, but measuring it, it showed to be slower, don't know why. |
| 80 | + |
| 81 | +#### Extract the Strings Using `SetString` |
| 82 | +manual string concatenation and splitting proved to be very slow. |
| 83 | +Using `SetString` yielded a noticeable improvement. Remaining in the realm of pointers would probably be much faster, but I haven't ventured there (yet, maybe if time is available). |
| 84 | + |
| 85 | +#### Decode Temperature Into Numerical Value |
| 86 | +First attempt was to use `StrToFloat`, which was pretty catastrophic. Using `Val` was still slow, but definitely a step-up. `Val` with SmallInt proved to be faster than with Double, even though there's extra operations to be done. |
| 87 | +So now we need to get rid of the `.` character. |
| 88 | + |
| 89 | +Again, string functions being very slow, replicating the last character at length-1, and then reduce the length of the AnsiString using `SetLength` yielded faster results. I tried doing so with pAnsiChar but got some range-check errors. Might get back to it later on. |
| 90 | + |
| 91 | +Finally, assuming temperatures in the range `-100` and `+100`, with 1 decimal figure, there should be 2000 different temperatures possible. |
| 92 | +Instead of decoding the same temperature values using `Val`, do it once, store it in a TDictionary (TemperatureString -> TemperatureSmallInt). There were I believe 1998 different temperature values, so we only call `Val` 1998 times instead of 1 billion times. Over an input size of 100M, the gain was 4-5 seconds (total 28s -> 23s) |
| 93 | + |
| 94 | +#### Accumulate Data Into a Dictionary of Records |
| 95 | + - the records are packed, with minimal size |
| 96 | + - the dictionary maps StationName -> Pointer to record, to avoid passing around full records |
| 97 | + - records are pre-allocated in an array of 45,000, instead of allocating them on-the-fly. |
| 98 | + - when a station is not found in the dictionary, we point to the next element in the records-array. |
| 99 | + - with an input of size 100M, this accumulation step takes a considerable amount of time (9 seconds out of 23 total). I haven't identified yet if it is the `dict.Add` that takes time, the `dict.TryGetValue`, or just generally the dictionary hash collisions. Even though the dictionary is pre-allocated with a capacity of 45,000, but that did not seem to improve much. I also tried the dictionary implementation of Spring4D, but also no improvements. |
| 100 | + |
| 101 | +### 3. Output the Results |
| 102 | +Since I started using pointers (pAnsiChar), getting a matching hash was a bit of a pickle: |
| 103 | +Some Unicode characters were messed up in their display, or messed up in their ordering. |
| 104 | +Eventually, the ordering issue was resolved by using `AnsiStrings.CompareStr` instead of `SysUtils.CompareStr`. This step will clearly remain single-threaded, but takes 0.15 seconds for all 45,000 stations, so it is not a big deal. |
| 105 | + |
| 106 | +### Conclusion |
| 107 | + |
| 108 | +If there are any improvements to be done in this single-threaded version, they would be in the following order, from most impactful to least impactful (performance numbers over an input of 100M, not 1B): |
| 109 | + - try improve storing / retrieval from the dictionary of station data, cost: 9 sec / 23 sec |
| 110 | + - try improve the extraction of string data, maybe using pointers (6.5 sec / 23 sec) |
| 111 | + - try improve the type conversion, though not sure how at this point (4.5 sec / 23 sec) |
| 112 | + - somehow, incrementing an integer 1B times is taking 1.2 seconds, while incrementing the main input index (16B times) is only taking 0.5 second. It's just 1.2 seconds, but I'm not understanding why it should behave that way. |
| 113 | + |
| 114 | + |
| 115 | +## Single-Threaded Attempt (2024-04-09) |
| 116 | + |
| 117 | +Turns out mORMot2 does not compile under Delphi for Linux x64 due to poor Delphi support for Linux. |
| 118 | +This implementation was migrated as-is with Lazarus (FPC), which is much slower than Delphi on Windows x64. |
| 119 | + |
| 120 | +**Timing as per gcarreno: 176 seconds, single-threaded** |
| 121 | + |
| 122 | +Since then, a few improvements: |
| 123 | + |
| 124 | +### better temperature conversion to SmallInt |
| 125 | + |
| 126 | +In the previous implementation, we identified `Val` and `StrToFloat` as a source of slowness. @abouchez suggested that the numerical value could be obtained without those. |
| 127 | +Indeed, by using Ord of the needed characters, offsetting by the ASCII value of '0', and multiplying by the right power of 10, the temperature can be extracted as integer at a fraction of the cost. |
| 128 | + |
| 129 | +### better usage of dictionary data-structure: |
| 130 | + |
| 131 | +Instead of extracting the station name as a string 1B times, and use it as a dictionary key, instead use the FPC hashing function to reduce the string extraction to ~45k times (once per station), and otherwise use the pointer + length (station_name) to compute the hash. |
| 132 | +This requires us to migrate from TFPHashList to the generic TDictionary. Even though TDictionary is slower than TFPHashList, the overall improvements yielded a significant performance gain. |
| 133 | +Using TDictionary, the code is more similar to my Delphi version, in case we get to run those tests on a Windows PC. |
| 134 | + |
| 135 | +** expected timing: ~60 seconds, single-threaded** |
0 commit comments