thecppzoo
diff --git a/‎benchmark/atoi-corpus.h‎
Lines changed: 0 additions & 3 deletions b/‎benchmark/atoi-corpus.h‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎benchmark/atoi.h‎
Lines changed: 1 addition & 0 deletions b/‎benchmark/atoi.h‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmark/catch2swar-demo.cpp‎
Lines changed: 9 additions & 7 deletions b/‎benchmark/catch2swar-demo.cpp‎
Lines changed: 9 additions & 7 deletions
diff --git a/‎pokerbotic/README.md‎
Lines changed: 66 additions & 0 deletions b/‎pokerbotic/README.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎pokerbotic/SWAR CPPCon 2019 - reduced.key‎
7.35 MB b/‎pokerbotic/SWAR CPPCon 2019 - reduced.key‎
7.35 MB
diff --git a/‎pokerbotic/design/Fastest-Floyd-Sampling.md‎
Lines changed: 35 additions & 0 deletions b/‎pokerbotic/design/Fastest-Floyd-Sampling.md‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎pokerbotic/design/Hand-Ranking.md‎
Lines changed: 63 additions & 0 deletions b/‎pokerbotic/design/Hand-Ranking.md‎
Lines changed: 63 additions & 0 deletions
@@ -120,23 +120,20 @@ struct CorpusStringLength {
     }
 };
 
-
 #if ZOO_CONFIGURED_TO_USE_AVX()
 #define AVX2_STRLEN_CORPUS_X_LIST \
     X(ZOO_AVX, zoo::avx2_strlen)
 #else
 #define AVX2_STRLEN_CORPUS_X_LIST /* nothing */
 #endif
 
-
 #define STRLEN_CORPUS_X_LIST \
     X(LIBC_STRLEN, strlen) \
     X(ZOO_STRLEN, zoo::c_strLength) \
     X(ZOO_NATURAL_STRLEN, zoo::c_strLength_natural) \
     X(GENERIC_GLIBC_STRLEN, STRLEN_old) \
     AVX2_STRLEN_CORPUS_X_LIST
 
-
 #define X(Typename, FunctionToCall) \
     struct Invoke##Typename { int operator()(const char *p) { return FunctionToCall(p); } };
 
 
@@ -5,6 +5,7 @@
 
 uint32_t parse_eight_digits_swar(const char *chars);
 uint32_t lemire_as_zoo_swar(const char *chars);
+
 std::size_t spaces_glibc(const char *ptr);
 
 namespace zoo {
 
@@ -26,10 +26,12 @@ TEST_CASE("Atoi benchmarks", "[atoi][swar]") {
         auto zl2 = zoo::c_strLength(skipFst);
         auto strlen2 = strlen(skipFst);
         REQUIRE(zl2 == strlen2);
-        auto avx1 = zoo::avx2_strlen(TwoStrings);
-        REQUIRE(avx1 == strlen1);
-        auto avx2 = zoo::avx2_strlen(skipFst);
-        REQUIRE(avx2 == strlen2);
+        #if ZOO_CONFIGURED_TO_USE_AVX()
+            auto avx1 = zoo::avx2_strlen(TwoStrings);
+            REQUIRE(avx1 == strlen1);
+            auto avx2 = zoo::avx2_strlen(skipFst);
+            REQUIRE(avx2 == strlen2);
+        #endif
     }
     auto corpus8D = Corpus8DecimalDigits::makeCorpus(g);
     auto corpusStrlen = CorpusStringLength::makeCorpus(g);
@@ -49,9 +51,9 @@ TEST_CASE("Atoi benchmarks", "[atoi][swar]") {
     REQUIRE(fromZOO_STRLEN == fromLIBC_STRLEN);
     REQUIRE(fromLIBC_STRLEN == fromZOO_NATURAL_STRLEN);
     REQUIRE(fromGENERIC_GLIBC_STRLEN == fromZOO_NATURAL_STRLEN);
-#if ZOO_CONFIGURED_TO_USE_AVX()
-    REQUIRE(fromZOO_AVX == fromZOO_STRLEN);
-#endif
+    #if ZOO_CONFIGURED_TO_USE_AVX()
+        REQUIRE(fromZOO_AVX == fromZOO_STRLEN);
+    #endif
 
     auto haveTheRoleOfMemoryBarrier = -1;
     #define X(Type, Fun) \
 
@@ -0,0 +1,66 @@
+# Pokerbotic
+
+## This repository will be changing very soon to incorporate feedback from my CPPCon 2019 presentation, please check back in a couple of weeks
+
+**Pokerbotic** is a poker engine.  It has been developed by a professional software engineer and a semi-professional poker player with professional knowledge of stochastic processes little by little.
+
+Currently, we have the hand evaluator framework, that achieves in normally available machines a rate of 100 million evaluations per second, that is, it classifies more than 100 million poker hands into what "four of a kind", etc. they are.
+
+**The code today assumes the AMD64 architecture**, and support of the [BMI2 instructions](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2_.28Bit_Manipulation_Instruction_Set_2.29).  AMD64/Intel is not essential to this code, just that the necessary adaptations have not been made.  You are welcome to help with this.
+
+Currently, the code is a header-only framework with some use cases programmed in C++ 14.
+
+This code beats other poker engines, including the popular open source framework "PokerStove" both on ease of use and performance due to the application of Generic Programming.
+
+Generic Programming allows hoisting what otherwise would be run-time computation to compilation time, this is illustrated in the non-trivial `static_assert` in the code itself.
+
+The documentation for the advanced programming techniques, including the Floyd sampling algorithm, the SWAR techniques is being written.
+
+## How to build it
+
+### Prerequisites:
+
+1. GCC compatible compiler.  We recommend Clang 3.9 or 4.0 specifically.  Benchmarks indicate Clang gives noticeably faster code.  The code uses GCC extensions in the way of builtins.
+2. C++ 14.  In GCC or Clang, do not forget the option `-std=c++14`
+3. Support for BMI2 instructions, activated with `-march=native` (preferred way) or specifically with `-mbmi2`
+4. Test cases require the ["Catch" testing framework](https://github.com/philsquared/Catch).
+5. Currently the code does not require a Unix/POSIX operating system (this code should be compilable in Windows64 through either gcc or clang), however, **we reserve the option to make the code incompatible with any operating system**.
+
+### There are several test programs available:
+
+#### Unit tests at [src/main.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/main.cpp)
+
+Several unit tests.  This program illustrates how to use the engine framework.  To build it, at the project root, you may do this:
+
+`clang++ -std=c++14 -Iinc -DTESTS -O3 -march=native -I../Catch/include src/main.cpp -o main`
+
+Notice you have to define TESTS and indicate the path to the "Catch" testing framework.
+
+#### [src/benchmarks.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/benchmarks.cpp)
+
+A program that measures the execution speed of several internal mechanisms.  To build it, at the project root, you may do this:
+
+`clang++ -std=c++14 -Iinc -DBENCHMARKS -O3 -march=native src/benchmarks.cpp -o benchmarks`
+
+This program can be run without arguments. It will generate all 7-card hands and time the execution of all evaluations.
+
+#### [src/comparisonBenchmark.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/comparisonBenchmark.cpp)
+
+This program generates as in Texas Hold'em Poker, all possible 5-community cards, and proceeds to iterate over all two-player 2-card "pocket cards".
+
+Because of the size of this search space, this program emits a current tally of execution every 100 million cases.
+
+To build, for example:
+
+`clang++ -std=c++14 -Iinc -DHAND_COMPARISON -O3 -march=native -o cb src/comparisonBenchmark.cpp`
+
+Can be run without arguments.
+
+## Next feature to be implemented
+
+Currently, multithreaded partitioning of evaluations is being implemented.
+
+## Documentation/User manual
+
+Not yet written.  Most of the code available under the folder `ep/` is fully operational.
+
@@ -0,0 +1,35 @@
+The Floyd sampling algorithm --you can see an excellent exposition [here](http://www.nowherenearithaca.com/2013/05/robert-floyds-tiny-and-beautiful.html)-- is very convenient for use cases such as getting a hand of cards from a deck.
+
+The fastest way to represent sets of finite and small domains (such as a deck of cards) seems to be as bits in bitfields.
+
+For an example of a deck of 52 cards, we may want, for example, to generate all of the 7-card hands.  I wrote an straightforward implementation [here](https://github.com/thecppzoo/pokerbotic/blob/master/inc/ep/Floyd.h).  Its interface is this:
+
+```c++
+template<int N, int K, typename Rng>
+inline uint64_t floydSample(Rng &&g);
+```
+
+With speed in mind, the size of the set and subset are template parameters.  Compilers such as Clang, GCC routinely generate optimal code for the given sizes, as can be seen in the compiler explorer, which means they take advantage of those parameters being template parameters.  The return value is the subset expressed as the bits set in the least significant N bits of the resulting integer.
+
+However, what if the use case is to generate a sample (subset) of the *remaining* members of the set? for example, to generate a random 2-card *after* five cards have been selected?
+
+That has been implemented too, in a function with this signature:
+
+```c++
+template<int N, int K, typename Rng>
+inline uint64_t floydSample(Rng &&g, uint64_t preselected)
+```
+
+Here, `preselected` represents the cards already selected.  If what is desired is to get two cards from the cards remaining after selecting `fiveAlreadySelected` cards, the call `ep::floydSample<47, 2>(randomGenerator, fiveAlreadySelected)` will suffice.  Notice the template argument for `N` is now 47, reflecting the fact that the remaining set of cards has 47 cards.  Unfortunately, it is difficult to guarantee at compilation time that the argument `fiveAlreadySelected` indeed has exactly five elements, because operations such as intersection or union result in sets with cardinalities that are fundamentally run-time values.
+
+This overload for `ep::floydSample` requires calling a "deposit" operation.  This is an interesting operation hard to implement without direct support from the processor:  Given a mask, the bits of the input will be "deposited" one at a time into the bit positions indicated as bits set in the mask.  In the AMD64/Intel architecture EM64T this is supported in the instruction set "BMI2" as the instruction [`PDEP`](https://chessprogramming.wikispaces.com/BMI2).  The implementation of the adaptation of the Floyd algorithm for a known number of preselected elements is then straightforward: discount from the total the number of bits set, call normal floydSample, and "deposit" the result in the inverse of the preselection.
+
+What are the costs of these implementations?
+
+1. The programmer needs to indicate at compilation time the number of elements in the set.  If this number is a runtime value, a `switch` will be needed to convert runtime to compile time numbers, that transforms into an indexed jump at the assembler level.
+2. All of the operations in the normal Floyd sampling algorithm are negligible in terms of execution costs compared to calling the random number generator, which is essential in each iteration.
+3. The adaptation to account for preselections only requires two assembler instructions more: inverting the preselection and depositing it.  `PDEP` has been measured to be an instruction with a throughput of one per clock, which is excellent compared to implementing it in software; however, in current processors it can only be executed in a particular pipeline.  In Pokerbotic we don't think we are oversubscribing this pipeline, so we suspect we get a 1-per-clock throughput for this use case.
+4. However, the adaptation to account for preselections also require the programmer to accurately indicate the cardinality of the preselection.  This can add the same cost as number 1 here, plus the population count, another single-pipeline, 1-per-clock throughput instruction.
+
+We are interested in any way to implement a faster subset sample selection.  This use case is at the heart of many operations in Pokerbotic.
+
@@ -0,0 +1,63 @@
+# Design of the hand classification mechanism in Pokerbotic
+
+## Detection of N-of-a-kind
+
+Detection of N-of-a-kind, two pairs and full house is described [here](https://github.com/thecppzoo/pokerbotic/blob/master/design/What-is-noak-or-how-to-determine-pairs.md).
+
+## Detection of flush
+
+Flush detection happens at [Poker.h:78](https://github.com/thecppzoo/pokerbotic/blob/master/inc/ep/Poker.h#L78) the hand is filtered per each suit, and the built in for population count on the filtered set is called.  This code assumes a hand can only have one suit in flush (or that the hand has up to 9 cards).
+
+## Detection of straights
+
+The straightforward way to detect straights, if the ranks would be consecutive bits (which we will call "packed representation") is this:
+
+```c++
+unsigned straights(unsigned cards) {
+    auto shifted1 = cards << 1;
+    auto shifted2 = cards << 2;
+    auto shifted3 = cards << 3;
+    auto shifted4 = cards << 4;
+    return cards & shifted1 & shifted2 & shifted3 & shifted4
+}
+```
+
+By shifting and doing a conjuction at the end, the only bits in the result set to one are those that are succeeded by four consecutive bits set to one.  Before accounting for the aces to work as "ace or one", there are possible improvements to be discussed:
+
+### Checking for the presence of 5 or 10
+
+In a deck of 13 ranks starting with 2, all straights must have either the rank 5 or the rank ten.  This has a probability of nearly a third; however, testing for this explicitly is performance disadvantageous.  It seems the branch is fundamentaly not predictable by the processor, so, the penalty of misprediction overcompensates the benefit of early exit.  In the code above, there are 8 binary operators and 4 compile-time constants, there is little budget for branch misprediction.  Older versions of the code had this check until it was benchmarked to be a disadvantage.
+
+### Checking for partial conjunctions
+
+For the same reason, testing if any of the conjunctions is zero to return 0 early is not performance advantageous, confirmed through benchmarking.
+
+### Addition chain
+
+There is one improvement that benchmarks confirm:
+
+```c++
+unsigned straights(unsigned cards) {
+    // assume the point of view from the bit position for tens.
+    auto shift1 = cards >> 1;
+    // in shift1 the bit for the rank ten now contains the bit for jacks
+    auto tj = cards & shift1;
+    auto shift2 = tj >> 2;
+    // in shift2, the position for the rank ten now contains the conjunction of originals queen and king
+    auto tjqk = tj & shift2;
+    return tjqk & (cards >> 4);
+}
+```
+
+This implementation (which does not take into account the ace duality) requires 6 binary operations and 3 constants and accomplishes the same thing as the straightforward implementation.  Benchmarks confirm this taking roughly 3/4 of the time than the straightforward implementation.
+
+The key insight here is to view the detection of the straight as adding up to 5 starting with 1.  The straightforward implementation does the equivalent of `1 + 1 + 1 + 1 + 1`, this new implementation does `auto two = 1 + 1; return two + two + 1`.  This technique is to build an *addition chain*.  This technique was inspired by the second chapter of the book ["From Mathematics To Generic Programming"](https://www.amazon.com/Mathematics-Generic-Programming-Alexander-Stepanov/dp/0321942043)
+
+Taking into account the dual rank of aces is simply to turn on the 'ones' if there is the ace, but this requires left shift to make room for it.  This can be done at the beginning of the straight check, and its cost can be amortized by the compiler doing a conditional move early, meaning the result will be ready by the time it is used.
+
+There is one further complication in the code, which is that the engine uses the rank-array representation.  Provided that the shifts are for 4, 8, 12, 16 bits instead of 1, 2, 3, 4 there isn't yet a difference.  There are two needs for straights:
+
+1. Normal straights, in which the suit of the rank does not matter.  This is accomplished by making the 13 rank counts as described in how to detect pairs, etc., and using the SWAR operation `greaterEqual<1>` prior to the straight code.  Naturally, the straights don't incurr in an extra cost of doing popcounts because they are amortized in the necessary part of detection of pairs, three of a kind, etc., the `greaterEqual<N>(arg)` operation requires two constants and two or three assembler operations, depending on how the result is used, thus, for practical purposes have negligible cost compared to a packed rank representation.
+2. Straights to detect straight flush:  Since the bits for
+
+We suspect our detection of straight code is maximal in terms of performance.