Long time, no SIMD #Haskell
JK, the naivest stupidest binary tree blows fancy 4-way SIMDed BVH out of the water.
Actually, it may be slower by itself, but multicore apparently *destroys* wide instruction performance. At the same time cache^W completely oblivious scalar traversal is happy to run on all the capabilities available.
High ceremony 4-wide or primitive 10-/20-/whatever-wide?
A bit fixed and pampered up, rawr.