Just consider the generated code for the two foo variants:
https://godbolt.org/z/P7TMe4vax
The code using double interleaves multiplications for the first and second summand to achieve more ILP. The code using std::simd doesn't do this.
IMHO the reason for this is the usage of intrinsics in the std::simd implementation, which prevents such optimizations. This can have quite an impact on the acceleration you get from vectorization.
I have no good solution for this issue yet, I just want to raise the awareness here.
Maybe some annotations for intrinsics must be introduced, which can tell the compiler, that the annotated intrinsic is allowed to be optimized.