in Brainfuck, loops that look like [>>>] are common. these are memory scans: this one sweeps memory in the positive direction, stopping when a zero-valued cell is found at a multiple of 3 from the starting position.

if I want this to go fast on Neon, I can load 16 bytes into a vector register, test them all for equality with zero, and then what? I guess add up the resulting vector and check if the sum is zero, and if it isn't, scan the elements one by one? is there a faster way to do this?

Follow

@regehr

I'm not sure if that's anyhow fast, but you could perform the comparison (ending with an all-ones byte where it succeeded and all-zeroes elsewhere), AND it with a vector that has (2^i)-i in i-th position, OR it up, add 1, and use any of the scalar ways of figuring out which power of two that is.

(Alternatively, have the second vector have 2^i in i-th position, AND it with first, add them up and use a scalar way to figure out which is the highest bit set.

@robryk interesting. I'll float a scheme like this to my class and see if I can nerdsnipe some of them :).

@regehr

My main worry would be whether there isn't something tricky around horizontal operations (e.g. that ones that cross some boundary are much slower, or that only a few of them are passably fast).

@robryk but excited to get my class working on this sort of thing, we'll see who creates fast code. I don't want to present them with a finished solution, that's boring!

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.