I just wrote an implementation in SSE1 for -[OFMatrix4x4 transformVectors:count:]
in @objfw and I have to say: I fucking hate SSE1!
SSE1 is lacking horizontal add (haddps
), something 3DNow! had from the very start (pfacc
). Newer SSE versions have it, but on many CPUs it’s slow as hell. This results in requiring a bunch of shufps
- a really braindeadly designed instruction. Seriously, you couldn’t do worse.
The end result is that the SSE1-version that is littered with shufps
is slower than doing non-vectorized calculations with SSE (just mulss
, addss
etc.).
Intel designed SSE1 specifically for matrix multiplications with floats and I have never seen anything that would be a worse fit for it than SSE1.
So, yeah, I’m not gonna commit that, because the stupid code generated by Clang that isn’t vectorized at all is faster.
(unless you target -m32 -march=i686
, but in that case, it would still be better to provide a non-vectorized version then).
But seriously, Intel, how much can you fail? Design an instruction set for one thing (matrix multiplications) and then make sure it sucks as hard as possible for that?
My SSE4.1 version at least is faster than the non-vectorized version, if only by a little.
@robryk @objfw It does better, however it still gets beaten by the non-vectorized version. However, my SSE4.1 version beats both: https://objfw.nil.im/file?ci=trunk&name=src/OFMatrix4x4.m&ln=41-70
@objfw @robryk Oh wow! Got it down to 8 registers by reloading the very last part from memory all the time. That made it even faster!. This is by far the fastest of all the various implementations I had now: https://objfw.nil.im/info/cf955413ab508784
@robryk @objfw Ouch. That’s because the vector reload slipped out of the loop, meaning the same vector was used all the time! https://objfw.nil.im/info/9ba7594f7b30df9b
But it’s still the fastest version. Which means it really is memory bound. Keeping the entire matrix in registers all the time is actually less useful, better to keep the entire vector in there and miss the last row.
But I’ll probably just add an #ifdef OF_AMD64 and use the extra register if it’s available - that’s always the best option.
@robryk @objfw Not sure what exactly you mean?