That's a lot of code. But I think it's possible to do in less than 17 instructions, and with minimal shuffles. I sure hope everything aligns well for vaddsubps.
vextractf128 is 1.8% faster than vpermf128 + movaps
Oh well, on paper the former was faster, glad I tested it.
Right now I dislike how subtraction is non-associative but my liking for addition being associative makes up for it.
QOTO: Question Others to Teach Ourselves. A STEM-oriented instance.
An inclusive free speech instance.
All cultures and opinions welcome.
Explicit hate speech and harassment strictly forbidden.
We federate with all servers: we don't block any servers.