I'm not sure if that's anyhow fast, but you could perform the comparison (ending with an all-ones byte where it succeeded and all-zeroes elsewhere), AND it with a vector that has (2^i)-i in i-th position, OR it up, add 1, and use any of the scalar ways of figuring out which power of two that is.
(Alternatively, have the second vector have 2^i in i-th position, AND it with first, add them up and use a scalar way to figure out which is the highest bit set.
My main worry would be whether there isn't something tricky around horizontal operations (e.g. that ones that cross some boundary are much slower, or that only a few of them are passably fast).
@robryk yeah I have no idea!
@robryk but excited to get my class working on this sort of thing, we'll see who creates fast code. I don't want to present them with a finished solution, that's boring!
@robryk interesting. I'll float a scheme like this to my class and see if I can nerdsnipe some of them :).