Signed-digit based ecmult_const algorithm #1184

sipa · 2022-12-30T06:39:21Z

Using some insights learned from #1058, this replaces the fixed-wnaf ecmult_const algorithm with a signed-digit based one. Conceptually both algorithms are very similar, in that they boil down to summing precomputed odd multiples of the input points. Practically however, the new algorithm is simpler because it's just using scalar operations, rather than relying on wnaf machinery with skew terms to guarantee odd multipliers.

The idea is that we can compute $q \cdot A$ as follows:

Let $s = f(q)$, for some function $f()$.
Compute $(s_1, s_2)$ such that $s = s_1 + \lambda s_2$, using secp256k1_scalar_lambda_split.
Let $v_1 = s_1 + 2^{128}$ and $v_2 = s_2 + 2^{128}$ (such that the $v_i$ are positive and $n$ bits long).
Computing the result as $$\sum_{i=0}^{n-1} (2v_1[i]-1) 2^i A + \sum_{i=0}^{n-1} (2v_2[i]-1) 2^i \lambda A$$ where $x[i]$ stands for the i'th bit of $x$, so summing positive and negative powers of two times $A$, based on the bits of $v_1.$

The comments in ecmult_const_impl.h show that if $f(q) = (q + (1+\lambda)(2^n - 2^{129} - 1))/2 \mod n$, the result will equal $q \cdot A$.

This last step can be performed in groups of multiple bits at once, by looking up entries in a precomputed table of odd multiples of $A$ and $\lambda A$, and then multiplying by a power of two before proceeding to the next group.

The result is slightly faster (I measure ~2% speedup), but significantly simpler as it only uses scalar arithmetic to determine the table lookup values. The speedup is due to the fact that no skew corrections at the end are needed, and less overhead to determine table indices. The precomputed table sizes are also made independent from the ecmult ones, after observing that the optimal table size is bigger here (which also gives a small speedup).

real-or-random

Whoa nice! The skew correction is really annoying to reason about...

I just don't know where to get all the review power from.

src/tests.c

sipa · 2022-12-30T22:12:03Z

Added a commit to remove secp256k1_scalar_shr_int, which is now unused apart from tests. It's a net reduction diff now, even with all the comments it adds.

sipa · 2023-01-05T03:30:37Z

@peterdettman Randomly tagging you, you may find this interesting.

apoelstra · 2023-01-07T14:05:05Z

Very cool! I wonder if the explanation might be clearer if you said something like "without any further optimizations, k0 would be 2^n - 1 with n = 256; however we are going to combine this scheme with the GLV endomorphism and do a windowed rather than bitwise approach. So instead we solve for k0" rather than "for some constant k0".

In the end I was able to understand what you were doing without much trouble, so what you've written is fine, but it was a bit intimidating to see k0, k1, k2 all introduced at once just as unknowns.

apoelstra

ACK f16c500

sipa · 2023-01-07T18:50:34Z

@apoelstra I've changed the explanation to introduce the offset terms a bit more gently.

I've also dropped the ecmult_const_get_scalar_bit_group function as secp256k1_scalar_get_bits_var can be used instead (the _var part is not an issue as it's only variable-time in the offset/length, not the scalar).

apoelstra · 2023-01-08T15:16:23Z

@sipa this looks way better, thanks!

apoelstra

ACK b4ff4eb

sipa · 2023-01-19T20:37:01Z

Rebased after merge of #1170 and #1190.

sipa · 2023-01-20T22:09:26Z

Mental note: instead of adding $2^{128}$ to the split scalars, I believe it's also possible to (a) conditionally negate the scalar (if it's negative) and then (b) swap the meaning of positive/negate in the table (alternatively, bitwise negate the integers read from the table). This would mean we only need enough table coverage for 128 bits rather than 129, which for window=4 means one fewer addition.

EDIT: no, doesn't work

sipa · 2023-01-23T16:16:13Z

Oops, my previous rebase accidentally reverted the changes I made to address @apoelstra's comment. Re-rebased.

src/ecmult_const_impl.h

peterdettman · 2023-02-11T05:35:37Z

Mental note: instead of adding 2128 to the split scalars, I believe it's also possible to (a) conditionally negate the scalar (if it's negative) and then (b) swap the meaning of positive/negate in the table (alternatively, bitwise negate the integers read from the table). This would mean we only need enough table coverage for 128 bits rather than 129, which for window=4 means one fewer addition.

EDIT: no, doesn't work

Are you able to explain why that doesn't work?

peterdettman · 2023-02-11T08:06:41Z

Nice idea, simple and clear.

I would prefer to use 2^128 as the offset, allowing the split to use the full two's complement range of [-2^128, 2^128). That generalizes better to other split schemes (I have in mind the basis reduction of https://ia.cr/2020/454).
The Straus ladder has the same need to deal with negative split outputs; should the mechanism be unified (one way or the other?)
I don't have a build to test the performance of the current _scalar_split_lambda, but it's doing quite a bit of unnecessary work compared to the "original" non-modular formula, which it seems could be done cmod 2^129 (i.e. 129-bit two's complement representation) - and e.g. avoid calculating the 384 bits that are shifted away. Not to belabor the point, but if the split outputs were in 129-bit two's complement, adding a 2^128 offset is just a bit flip (or maybe a single limb complement).

sipa · 2023-02-11T15:11:27Z

I've switched to offset $2^{128}$.

src/ecmult_const_impl.h

peterdettman · 2023-02-11T18:38:45Z

I think if _scalar_split_lambda just output 129-bit two's complement values (and then sign-extend to ECMULT_CONST_BITS), then they would be directly usable in the comb, no offset needed for s1, s2. Something for the future, perhaps.

sipa · 2023-10-23T15:29:13Z

Rebased and addressed the comments above.

jonasnick

ACK mod nits c8eb787

I added a couple of (micro-)nit fixups to my branch. Most notably, it tries to unify the first iteration in ecmult_const with the loop and adds an explanation for the index computation in ECMULT_CONST_TABLE_GET_GE.

real-or-random

utACK mod these nits

also utACK on jonas' fixups (except a typo and one further comment here: jonasnick@cdf545a )

I believe I've done a thorough review, but I really have only nits.

src/scalar_8x32_impl.h

src/ecmult_const_impl.h

src/tests.c

src/ecmult_const_impl.h

stratospher

ACK c8eb787.

was able to follow the useful comments and code! also observed a 2% speedup on my machine.

src/ecmult_const_impl.h

Co-authored-by: Jonas Nick <jonasd.nick@gmail.com> Co-authored-by: Tim Ruffing <crypto@timruffing.de>

* add test case for a=infinity The corresponding ecmult_const branch was not tested before this commit. * add test for edge cases

jonasnick

ACK 355bbdf

siv2r

ACK 355bbdf

real-or-random

ACK 355bbdf

real-or-random reviewed Dec 30, 2022

View reviewed changes

src/tests.c Outdated Show resolved Hide resolved

sipa force-pushed the 202212_sd_ecmult_const branch 2 times, most recently from a0c7934 to f16c500 Compare December 30, 2022 22:03

apoelstra approved these changes Jan 7, 2023

View reviewed changes

sipa force-pushed the 202212_sd_ecmult_const branch from f16c500 to b4ff4eb Compare January 7, 2023 18:47

apoelstra approved these changes Jan 8, 2023

View reviewed changes

sipa force-pushed the 202212_sd_ecmult_const branch from b4ff4eb to 4de39ef Compare January 19, 2023 20:36

sipa force-pushed the 202212_sd_ecmult_const branch from 4de39ef to 1c48861 Compare January 23, 2023 16:15

peterdettman reviewed Feb 10, 2023

View reviewed changes