@robryk @objfw It does better, however it still gets beaten by the non-vectorized version. However, my SSE4.1 version beats both: https://objfw.nil.im/file?ci=trunk&name=src/OFMatrix4x4.m&ln=41-70
@objfw @robryk Oh wow! Got it down to 8 registers by reloading the very last part from memory all the time. That made it even faster!. This is by far the fastest of all the various implementations I had now: https://objfw.nil.im/info/cf955413ab508784
@robryk @objfw Ouch. That’s because the vector reload slipped out of the loop, meaning the same vector was used all the time! https://objfw.nil.im/info/9ba7594f7b30df9b
But it’s still the fastest version. Which means it really is memory bound. Keeping the entire matrix in registers all the time is actually less useful, better to keep the entire vector in there and miss the last row.
But I’ll probably just add an #ifdef OF_AMD64 and use the extra register if it’s available - that’s always the best option.
@objfw @robryk And this brings it basically back to the same speed as that of the code that was missing the vector reload: https://objfw.nil.im/info/5edf0d083d8cbfc4
So, the larger your register set, the better. As always, nothing is slower than accessing memory. Except for that scratch pad - which is probably kept in L1 the entire time.
Now it makes me curious if in some register constrained situations mixing SSE and 3DNow! would have made sense .
If anybody has a Pentium 3 around, I’ll happily provide a patch and test case.