Skip to content

Commit 6614399

Browse files
Merge branch 'gcarreno:main' into main
2 parents 6570fde + ed58662 commit 6614399

File tree

7 files changed

+123
-39
lines changed

7 files changed

+123
-39
lines changed

README.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -170,16 +170,14 @@ These are the results from running all entries into the challenge on my personal
170170
- 250GB SSD
171171
- 1TB HDD
172172

173-
| # | Result (m:s.ms): SSD | Compiler | Submitter | Notes | Certificates |
174-
|--:|---------------------:|---------------------:|:---------|:--------------|:----------|:-------------|
173+
| # | Result (m:s.ms) | Compiler | Submitter | Notes | Certificates |
174+
|--:|----------------:|---------:|:----------|:------|:-------------|
175175
| 1 | 0:2.472 | lazarus-3.0, fpc-3.2.2 | Arnaud Bouchez | Using 16 threads | |
176176
| 2 | 0:16.874 | lazarus-3.0, fpc-3.2.2 | Székely Balázs | Using 16 threads | |
177177
| 3 | 0:20.046 | lazarus-3.0, fpc-3.2.2 | Lurendrejer Aksen | using 30 thread | |
178178
| 4 | 1:16.059 | lazarus-3.0, fpc-3.2.2 | Richard Lawson | Using 1 thread | |
179179
| 5 | 12:40.179 | lazarus-3.0, fpc-3.2.2 | Iwan Kelaiah | Using 1 thread | |
180180

181-
\* : Having issues with Linux watchdog killing the shell process
182-
183181
> ** NOTE **
184182
>
185183
> After some tests performed by @paweld, it makes no sense to have an `HDD` run.

entries/abouchez/README.md

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,15 @@ Here are the main ideas behind this implementation proposal:
3535
- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch);
3636
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
3737

38-
The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance. The L1 cache is well known to be the main bottleneck for any efficient in-memory process. We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
38+
## Why L1 Cache Matters
39+
40+
The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
41+
42+
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
43+
44+
We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
45+
46+
In our first attempt, the `Station[]` array was in fact not aligned to 64 bytes itself. In fact, the RTL `SetLength()` does not align its data to the item size. So the pointer was aligned by 32 bytes, and any memory access would require filling two L1 cache lines. So we added some manual alignement of the data structure, and got 5% better performance.
3947

4048
## Usage
4149

@@ -133,14 +141,60 @@ time ./abouchez measurements.txt -v -t=1
133141
```
134142
This `-t=1` run is for fun: it will run the process in a single thread. It will help to guess how optimized (and lockfree) our parsing code is, and to validate the CPU multi-core abilities. In a perfect world, other `-t=##` runs should stand for a perfect division of `real` time per the number of working threads, and the `user` value reported by `time` should remain almost the same when we add threads up to the number of CPU cores.
135143

136-
## Feedback Needed
144+
## Back To Reality
145+
146+
Our proposal has been run on the benchmark hardware, using the full automation.
147+
148+
With 30 threads (on a busy system):
149+
```
150+
-- SSD --
151+
Benchmark 1: abouchez
152+
Time (mean ± σ): 3.634 s ± 0.099 s [User: 86.580 s, System: 2.012 s]
153+
Range (min … max): 3.530 s … 3.834 s 10 runs
154+
155+
-- HDD --
156+
Benchmark 1: abouchez
157+
Time (mean ± σ): 3.629 s ± 0.102 s [User: 86.086 s, System: 2.008 s]
158+
Range (min … max): 3.497 s … 3.789 s 10 runs
159+
```
160+
161+
Later on, only the SSD values are shown, because the HDD version triggered the systemd watchdog, which killed the shell and its benchmark executable. But we can see that once the data is loaded from disk into the RAM cache, there is no difference with a `memmap` file on SSD and HDD. Linux is a great Operating System for sure.
162+
163+
With 24 threads:
164+
```
165+
-- SSD --
166+
Benchmark 1: abouchez
167+
Time (mean ± σ): 2.977 s ± 0.053 s [User: 53.790 s, System: 1.881 s]
168+
Range (min … max): 2.905 s … 3.060 s 10 runs
169+
```
170+
171+
With 16 threads:
172+
```
173+
-- SSD --
174+
Benchmark 1: abouchez
175+
Time (mean ± σ): 2.472 s ± 0.061 s [User: 27.787 s, System: 1.720 s]
176+
Range (min … max): 2.386 s … 2.588 s 10 runs
177+
```
137178

138-
Here we will put some additional information, once our proposal has been run on the benchmark hardware.
179+
With 16 threads and thread affinity (`-a` switch on command line):
180+
```
181+
-- SSD --
182+
Benchmark 1: abouchez
183+
Time (mean ± σ): 3.227 s ± 0.017 s [User: 39.731 s, System: 1.875 s]
184+
Range (min … max): 3.206 s … 3.253 s 10 runs
185+
```
186+
187+
So it sounds like if we should just run the benchmark with the `-t=16` option.
139188

140-
Stay tuned!
189+
It may be as expected:
190+
191+
- The Ryzen CPU has 16 cores with 32 threads, and it makes sense that using only the "real" cores with CPU+RAM intensive work is enough to saturate them;
192+
- It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Linux Operating System schedule the threads to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen.
141193

142194
## Ending Note
143195

196+
You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly.
197+
144198
There is a "*pure mORMot*" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
145199

146200
Arnaud :D

entries/abouchez/src/brcmormot.lpr

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
// a dedicated hash table is 40% faster than mORMot generic TDynArrayHashed
66

77
{$define CUSTOMASM}
8-
// a few % faster with some dedicated asm instead of mORMot code on x86_64
8+
// about 10% faster with some dedicated asm instead of mORMot code on x86_64
99

1010
{$I mormot.defines.inc}
1111

@@ -101,14 +101,14 @@ procedure TBrcList.Init(max: integer; align: boolean);
101101
SetLength(StationMem, max); // RTL won't align by 64 bytes
102102
Station := pointer(StationMem);
103103
if align then
104-
while PtrUInt(Station) and 63 <> 0 do // manual alignment
104+
while {%H-}PtrUInt(Station) and 63 <> 0 do // manual alignment
105105
inc(PByte(Station));
106106
SetLength(StationHash, HASHSIZE);
107107
end;
108108

109109
{$ifdef CUSTOMASM}
110110

111-
function crc32c(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler;
111+
function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler;
112112
asm
113113
xor eax, eax // it is enough to hash up to 15 bytes for our purpose
114114
mov ecx, len
@@ -130,7 +130,7 @@ function crc32c(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler
130130
@z:
131131
end;
132132

133-
function MemEqual(a, b: pointer; len: PtrInt): integer; nostackframe; assembler;
133+
function CompareMem(a, b: pointer; len: PtrInt): boolean; nostackframe; assembler;
134134
asm
135135
add a, len
136136
add b, len
@@ -164,9 +164,18 @@ function MemEqual(a, b: pointer; len: PtrInt): integer; nostackframe; assembler;
164164
mov al, byte ptr [a + len]
165165
cmp al, byte ptr [b + len]
166166
je @eq
167-
@diff: mov eax, 1
167+
@diff: xor eax, eax
168168
ret
169-
@eq: xor eax, eax // 0 = found (most common case of no hash collision)
169+
@eq: mov eax, 1 // = found (most common case of no hash collision)
170+
end;
171+
172+
{$else}
173+
174+
function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; inline;
175+
begin
176+
if len > 16 then
177+
len := 16; // it is enough to hash up to 16 bytes for our purpose
178+
result := DefaultHasher(0, buf, len); // fast mORMot asm hasher (crc32c)
170179
end;
171180

172181
{$endif CUSTOMASM}
@@ -176,16 +185,15 @@ function TBrcList.Search(name: pointer; namelen: PtrInt): PBrcStation;
176185
h, x: PtrUInt;
177186
begin
178187
assert(namelen <= SizeOf(TBrcStation.NameText));
179-
h := crc32c({$ifndef CUSTOMASM} 0, {$endif} name, namelen);
188+
h := dohash(name, namelen);
180189
repeat
181190
h := h and (HASHSIZE - 1);
182191
x := StationHash[h];
183192
if x = 0 then
184193
break; // void slot
185194
result := @Station[x - 1];
186195
if (result^.NameLen = namelen) and
187-
({$ifdef CUSTOMASM}MemEqual{$else}MemCmp{$endif}(
188-
@result^.NameText, name, namelen) = 0) then
196+
CompareMem(@result^.NameText, name, namelen) then
189197
exit; // found
190198
inc(h); // hash collision: try next slot
191199
until false;
@@ -460,7 +468,7 @@ function ByStationName(const A, B): integer;
460468
result := sa.NameLen - sb.NameLen;
461469
end;
462470

463-
function Average(sum, count: PtrInt): integer;
471+
function Average(sum, count: PtrInt): PtrInt;
464472
// sum and result are temperature * 10 (one fixed decimal)
465473
var
466474
x, t: PtrInt; // temperature * 100 (two fixed decimals)

generator/Common/generate.common.pas

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ interface
1616
{ TGenerator }
1717
TGenerator = class(TObject)
1818
private
19+
FOnly400Stations: Boolean;
1920
rndState: Array [0..1] of Cardinal;
2021
FInputFile: String;
2122
FOutPutFile: String;
@@ -27,7 +28,7 @@ TGenerator = class(TObject)
2728
AFileSize: Int64; ATimeElapsed: TDateTime): String;
2829
function Rng1brc(Range: Integer): Integer;
2930
public
30-
constructor Create(AInputFile, AOutputFile: String; ALineCount: Int64);
31+
constructor Create(AInputFile, AOutputFile: String; ALineCount: Int64; AOnly400Stations: Boolean = False);
3132
destructor Destroy; override;
3233

3334
procedure Generate;
@@ -61,11 +62,12 @@ implementation
6162

6263
{ TGenerator }
6364

64-
constructor TGenerator.Create(AInputFile, AOutputFile: String; ALineCount: Int64);
65+
constructor TGenerator.Create(AInputFile, AOutputFile: String; ALineCount: Int64; AOnly400Stations: Boolean);
6566
begin
6667
FInputFile := AInputFile;
6768
FOutPutFile := AOutputFile;
6869
FLineCount := ALineCount;
70+
FOnly400Stations := AOnly400Stations;
6971

7072
FStationNames := TStringList.Create;
7173
FStationNames.Capacity := stationsCapacity;
@@ -196,6 +198,13 @@ procedure TGenerator.Generate;
196198
progressBatch := floor(FLineCount * (linesPercent / 100));
197199
start := Now;
198200

201+
if FOnly400Stations then
202+
begin
203+
WriteLn('Only 400 weather stations in output file.');
204+
while FStationNames.Count > 400 do
205+
FStationNames.Delete(Rng1brc(FStationNames.Count));
206+
end;
207+
199208
// This is all paweld magic:
200209
// From here
201210
// based on code @domasz from lazarus forum, github: PascalVault

generator/Common/generate.console.pas

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,22 @@ interface
1212
;
1313

1414
const
15-
cShortOptHelp: Char = 'h';
16-
cLongOptHelp = 'help';
17-
cShortOptVersion: Char = 'v';
18-
cLongOptVersion = 'version';
19-
cShortOptInput: Char = 'i';
20-
cLongOptInput = 'input-file';
21-
cShortOptOutput: Char = 'o';
22-
cLongOptOutput = 'output-file';
23-
cShortOptNumber: Char = 'n';
24-
cLongOptNumber = 'line-count';
15+
cShortOptHelp: Char = 'h';
16+
cLongOptHelp = 'help';
17+
cShortOptVersion: Char = 'v';
18+
cLongOptVersion = 'version';
19+
cShortOptInput: Char = 'i';
20+
cLongOptInput = 'input-file';
21+
cShortOptOutput: Char = 'o';
22+
cLongOptOutput = 'output-file';
23+
cShortOptNumber: Char = 'n';
24+
cLongOptNumber = 'line-count';
25+
cShortOptStations: Char = '4';
26+
cLongOptStations = '400stations';
2527
{$IFNDEF FPC}
26-
cShortOptions: array of char = ['h', 'v', 'i', 'o', 'n'];
28+
cShortOptions: array of char = ['h', 'v', 'i', 'o', 'n', '4'];
2729
cLongOptions: array of string = ['help', 'version', 'input-file', 'output-file',
28-
'line-count'];
30+
'line-count', '400stations'];
2931
{$ENDIF}
3032

3133
resourcestring
@@ -44,7 +46,8 @@ interface
4446
var
4547
inputFilename: String = '';
4648
outputFilename: String = '';
47-
lineCount: Integer = 0;
49+
lineCount: Integer = 0;
50+
only400Stations: Boolean = False;
4851

4952
procedure WriteHelp;
5053

@@ -64,6 +67,7 @@ procedure WriteHelp;
6467
WriteLn(' -i|--input-file <filename> The file containing the Weather Stations');
6568
WriteLn(' -o|--output-file <filename> The file that will contain the generated lines');
6669
WriteLn(' -n|--line-count <number> The amount of lines to be generated ( Can use 1_000_000_000 )');
70+
WriteLn(' -4|--400stations Only 400 weather stations in output file');
6771
end;
6872

6973
end.

generator/Delphi/src/generator.dpr

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ begin
4242
WriteLn(Format(rsLineCount, [Double(lineCount)]));
4343
WriteLn;
4444

45-
FGenerator := TGenerator.Create(inputFilename, outputFilename, lineCount);
45+
FGenerator := TGenerator.Create(inputFilename, outputFilename, lineCount, only400Stations);
4646
try
4747
try
4848
FGenerator.generate;
@@ -146,8 +146,10 @@ begin
146146

147147
if (Length(FParams[I]) = 1) or (FParams[I][2] = '=') then
148148
ParamOK := CheckShortParams(FParams[I][1])
149+
else if Pos('=', FParams[I]) > 0 then
150+
ParamOK := CheckLongParams(Copy(FParams[I], 1, Pos('=', FParams[I]) - 1))
149151
else
150-
ParamOK := CheckLongParams(Copy(FParams[I], 1, Pos('=', FParams[I]) - 1));
152+
ParamOK := CheckLongParams(FParams[I]);
151153

152154
// if we found a bad parameter, don't need to check the rest of them
153155
if not ParamOK then
@@ -238,6 +240,11 @@ begin
238240
inc(valid);
239241
end;
240242

243+
only400Stations := (FParams.IndexOf(cShortOptStations) >= 0) or (FParams.IndexOf(cLongOptStations) >= 0);
244+
245+
writeln(only400Stations);
246+
writeln(fparams.Text);
247+
241248
// check if everything was provided
242249
Result := (valid = 3) and (invalid = 0);
243250
end;

generator/Lazarus/src/generator.lpr

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,19 +37,21 @@ procedure TOneBRCGenerator.DoRun;
3737
tmpLineCount: String;
3838
begin
3939
// quick check parameters
40-
ErrorMsg:= CheckOptions(Format('%s%s%s:%s:%s:',[
40+
ErrorMsg:= CheckOptions(Format('%s%s%s:%s:%s:%s',[
4141
cShortOptHelp,
4242
cShortOptVersion,
4343
cShortOptInput,
4444
cShortOptOutput,
45-
cShortOptNumber
45+
cShortOptNumber,
46+
cShortOptStations
4647
]),
4748
[
4849
cLongOptHelp,
4950
cLongOptVersion,
5051
cLongOptInput+':',
5152
cLongOptOutput+':',
52-
cLongOptNumber+':'
53+
cLongOptNumber+':',
54+
cLongOptStations+':'
5355
]
5456
);
5557
if ErrorMsg<>'' then
@@ -130,6 +132,8 @@ procedure TOneBRCGenerator.DoRun;
130132
Exit;
131133
end;
132134

135+
only400stations := HasOption(cShortOptStations, cLongOptStations);
136+
133137
inputFilename:= ExpandFileName(inputFilename);
134138
outputFilename:= ExpandFileName(outputFilename);
135139

@@ -138,7 +142,7 @@ procedure TOneBRCGenerator.DoRun;
138142
WriteLn(Format(rsLineCount, [ Double(lineCount) ]));
139143
WriteLn;
140144

141-
FGenerator:= TGenerator.Create(inputFilename, outputFilename, lineCount);
145+
FGenerator:= TGenerator.Create(inputFilename, outputFilename, lineCount, only400stations);
142146
try
143147
try
144148
FGenerator.Generate;

0 commit comments

Comments
 (0)