I just wrote an implementation in SSE1 for -[OFMatrix4x4 transformVectors:count:] in @objfw and I have to say: I fucking hate SSE1!

SSE1 is lacking horizontal add (haddps), something 3DNow! had from the very start (pfacc). Newer SSE versions have it, but on many CPUs it’s slow as hell. This results in requiring a bunch of shufps - a really braindeadly designed instruction. Seriously, you couldn’t do worse.

The end result is that the SSE1-version that is littered with shufps is slower than doing non-vectorized calculations with SSE (just mulss, addss etc.).

Intel designed SSE1 specifically for matrix multiplications with floats and I have never seen anything that would be a worse fit for it than SSE1.

So, yeah, I’m not gonna commit that, because the stupid code generated by Clang that isn’t vectorized at all is faster.
(unless you target -m32 -march=i686, but in that case, it would still be better to provide a non-vectorized version then).

But seriously, Intel, how much can you fail? Design an instruction set for one thing (matrix multiplications) and then make sure it sucks as hard as possible for that?

My SSE4.1 version at least is faster than the non-vectorized version, if only by a little.

Follow

@js @objfw

What hardware are you comparing on? I would expect the advantage on hardware that supported only sse1 to be greater than on anything more modern.

@robryk @objfw Zen 4. My retro machine only has 3DNow!, where it makes a decent difference (not the claimed 4x, not even 2x - more like 20%. But still, I’ll take it). Can’t test SSE there, unfortunately. The thing is: Compared to x87 it’s definitely faster. But the non-vectorized SSE version is still faster than the vectorized SSE version. I’m not sure if that would be different on a Pentium 3 - with all that shuffling, it’s just more instructions.

If anybody has a Pentium 3 around, I’ll happily provide a patch and test case.

@js @objfw

Is vectorizing point-wise multiplications, and doing horizontal additions in a nonvectorized fashion (still using sse1) still obviously worse than nonvectorized everything (or impossible, because you can't pull out single elements of vectors in the way you'd need to)?

@robryk @objfw That’s essentially what all the shuffles are for. Getting the individual elements out of the vector and adding them all up.

@js @objfw

And doing it via memory (well, l1 cache really) would be even slower?

@robryk @objfw Not sure what you mean. You mean something like this?

movaps %xmm0, (%rax)
shufps $0xC0, %xmm0, %xmm0
addss 4(%rax), %xmm0
addss 8(%rax), %xmm0
addss 12(%rax), %xmm0

(Written on a phone and untested - probably got the shufps immediate wrong. The weird thing is that the ss instructions don’t work on the lowest, but on the highest element. That’s why you need the shuffling around.)

Haven’t tried that yet. Can do that later, decided to reboot into Windows and play some games for now ;)

@robryk @objfw It does better, however it still gets beaten by the non-vectorized version. However, my SSE4.1 version beats both: https://objfw.nil.im/file?ci=trunk&name=src/OFMatrix4x4.m&ln=41-70

@objfw @robryk Oh, now I got an SSE1 version that beats everything, but needs 9 registers, so doesn’t work in 32 bit. Let’s see which register creates the least overhead when reloading it.

@objfw @robryk Oh wow! Got it down to 8 registers by reloading the very last part from memory all the time. That made it even faster!. This is by far the fastest of all the various implementations I had now: https://objfw.nil.im/info/cf955413ab508784

@robryk @objfw Ouch. That’s because the vector reload slipped out of the loop, meaning the same vector was used all the time! https://objfw.nil.im/info/9ba7594f7b30df9b

But it’s still the fastest version. Which means it really is memory bound. Keeping the entire matrix in registers all the time is actually less useful, better to keep the entire vector in there and miss the last row.

But I’ll probably just add an #ifdef OF_AMD64 and use the extra register if it’s available - that’s always the best option.

@objfw @robryk And this brings it basically back to the same speed as that of the code that was missing the vector reload: https://objfw.nil.im/info/5edf0d083d8cbfc4

So, the larger your register set, the better. As always, nothing is slower than accessing memory. Except for that scratch pad - which is probably kept in L1 the entire time.

Now it makes me curious if in some register constrained situations mixing SSE and 3DNow! would have made sense :flan_XD:.

@js @objfw

Was there no penalty incurred by using one after the other? (I expect no, but wouldn't be very surprised if CPU designers somehow managed to reuse parts of logic units for one for the other.)

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.