in Brainfuck, loops that look like [>>>] are common. these are memory scans: this one sweeps memory in the positive direction, stopping when a zero-valued cell is found at a multiple of 3 from the starting position.
if I want this to go fast on Neon, I can load 16 bytes into a vector register, test them all for equality with zero, and then what? I guess add up the resulting vector and check if the sum is zero, and if it isn't, scan the elements one by one? is there a faster way to do this?
I'm not sure if that's anyhow fast, but you could perform the comparison (ending with an all-ones byte where it succeeded and all-zeroes elsewhere), AND it with a vector that has (2^i)-i in i-th position, OR it up, add 1, and use any of the scalar ways of figuring out which power of two that is.
(Alternatively, have the second vector have 2^i in i-th position, AND it with first, add them up and use a scalar way to figure out which is the highest bit set.
@robryk but excited to get my class working on this sort of thing, we'll see who creates fast code. I don't want to present them with a finished solution, that's boring!
@robryk yeah I have no idea!