This truly is the final boss. 16-point FFT with a hard instruction target.

Except I'm also doing a 32-point and the world's weirdest 64-point afterwards, but the former doesn't seem hard, and the latter sounds like a heckin' fun time!

That's a lot of code. But I think it's possible to do in less than 17 instructions, and with minimal shuffles. I sure hope everything aligns well for vaddsubps.

vextractf128 is 1.8% faster than vpermf128 + movaps
Oh well, on paper the former was faster, glad I tested it.

Right now I dislike how subtraction is non-associative but my liking for addition being associative makes up for it.

@emilis @lynne

If 1*n is an idempotent operation, -1*n is "contrapotent".

