Satisfying work for the day. One of the items in the Vello pipeline where profiling revealed performance could be improved was computing the bounding boxes of paths. I was using atomic min/max for each path segment to compute bbox union, and that was expensive (1.3ms for paris-30k on M1 Max).
I figured out a way to do it with monoids instead of atomics, and it's now 400µs. The trick is "segmented reduction" which works amazingly well.
I might blog this, but not sure yet; other things to do.
@raph This is doing the reduction with min/max in a minimal-depth tree instead of starting from one end?
Doesn't sound too hairy, even though you called the project Vello.
@raph Hmm, but by itself monoid scan (APL scan, right? prefix sum) will give you maxes or mins from before the previous path end. I guess you need a weird max (or min) that takes the path-end bit into account, so that max(End(3), NotEnd(2)) is NotEnd(2), and then any standard prefix sum algorithm gives you the right answer?
@raph Neat, I never read the whole paper but I might have read that part.
@raph Parallel prefix sum with non-commutative monoids is underappreciated.
@radehi That's right, the monoid is not just min/max, but also incorporates the path end bit. That trick is explained in section 1.5 of Blelloch's 1993 "Prefix sums and their applications": https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf