robryk: "@js@nil.im @objfw@nil.im Was there no penalty i…"

What hardware are you comparing on? I would expect the advantage on hardware that supported only sse1 to be greater than on anything more modern.

js @js@nil.im · Nov 05, 2023, 21:17

js @js@nil.im · Nov 05, 2023, 21:17

Nov 05, 2023, 21:17

@robryk @objfw Zen 4. My retro machine only has 3DNow!, where it makes a decent difference (not the claimed 4x, not even 2x - more like 20%. But still, I’ll take it). Can’t test SSE there, unfortunately. The thing is: Compared to x87 it’s definitely faster. But the non-vectorized SSE version is still faster than the vectorized SSE version. I’m not sure if that would be different on a Pentium 3 - with all that shuffling, it’s just more instructions.

If anybody has a Pentium 3 around, I’ll happily provide a patch and test case.

**robryk** @robryk@qoto.org · Nov 05, 2023, 21:22

**robryk** @robryk@qoto.org · Nov 05, 2023, 21:22

Nov 05, 2023, 21:22

Is vectorizing point-wise multiplications, and doing horizontal additions in a nonvectorized fashion (still using sse1) still obviously worse than nonvectorized everything (or impossible, because you can't pull out single elements of vectors in the way you'd need to)?

js @js@nil.im · Nov 05, 2023, 21:23

js @js@nil.im · Nov 05, 2023, 21:23

Nov 05, 2023, 21:23

@robryk @objfw That’s essentially what all the shuffles are for. Getting the individual elements out of the vector and adding them all up.

**robryk** @robryk@qoto.org · Nov 05, 2023, 21:24

**robryk** @robryk@qoto.org · Nov 05, 2023, 21:24

Nov 05, 2023, 21:24

And doing it via memory (well, l1 cache really) would be even slower?

js @js@nil.im · Nov 05, 2023, 21:34

js @js@nil.im · Nov 05, 2023, 21:34

Nov 05, 2023, 21:34

@robryk @objfw Not sure what you mean. You mean something like this?

movaps %xmm0, (%rax)
shufps $0xC0, %xmm0, %xmm0
addss 4(%rax), %xmm0
addss 8(%rax), %xmm0
addss 12(%rax), %xmm0

(Written on a phone and untested - probably got the shufps immediate wrong. The weird thing is that the ss instructions don’t work on the lowest, but on the highest element. That’s why you need the shuffling around.)

Haven’t tried that yet. Can do that later, decided to reboot into Windows and play some games for now ;)

js @js@nil.im · Nov 06, 2023, 00:45

js @js@nil.im · Nov 06, 2023, 00:45

Nov 06, 2023, 00:45

@robryk @objfw It does better, however it still gets beaten by the non-vectorized version. However, my SSE4.1 version beats both: https://objfw.nil.im/file?ci=trunk&name=src/OFMatrix4x4.m&ln=41-70

js @js@nil.im · Nov 06, 2023, 00:49

js @js@nil.im · Nov 06, 2023, 00:49

Nov 06, 2023, 00:49

@objfw @robryk Oh, now I got an SSE1 version that beats everything, but needs 9 registers, so doesn’t work in 32 bit. Let’s see which register creates the least overhead when reloading it.

js @js@nil.im · Nov 06, 2023, 01:01

js @js@nil.im · Nov 06, 2023, 01:01

Nov 06, 2023, 01:01

@objfw @robryk Oh wow! Got it down to 8 registers by reloading the very last part from memory all the time. That made it even faster!. This is by far the fastest of all the various implementations I had now: https://objfw.nil.im/info/cf955413ab508784

js @js@nil.im · Nov 06, 2023, 20:14

js @js@nil.im · Nov 06, 2023, 20:14

Nov 06, 2023, 20:14

@robryk @objfw Ouch. That’s because the vector reload slipped out of the loop, meaning the same vector was used all the time! https://objfw.nil.im/info/9ba7594f7b30df9b

But it’s still the fastest version. Which means it really is memory bound. Keeping the entire matrix in registers all the time is actually less useful, better to keep the entire vector in there and miss the last row.

But I’ll probably just add an #ifdef OF_AMD64 and use the extra register if it’s available - that’s always the best option.

js @js@nil.im · Nov 06, 2023, 20:19

js @js@nil.im · Nov 06, 2023, 20:19

Nov 06, 2023, 20:19

@objfw @robryk And this brings it basically back to the same speed as that of the code that was missing the vector reload: https://objfw.nil.im/info/5edf0d083d8cbfc4

So, the larger your register set, the better. As always, nothing is slower than accessing memory. Except for that scratch pad - which is probably kept in L1 the entire time.

Now it makes me curious if in some register constrained situations mixing SSE and 3DNow! would have made sense .

**robryk** @robryk@qoto.org · 2023-11-07T23:29:33Z

Was there no penalty incurred by using one after the other? (I expect no, but wouldn't be very surprised if CPU designers somehow managed to reuse parts of logic units for one for the other.)

Nov 07, 2023, 23:29 · · · ·

js @js@nil.im · Nov 09, 2023, 20:32

js @js@nil.im · Nov 09, 2023, 20:32

Nov 09, 2023, 20:32