Satisfying work for the day. One of the items in the Vello pipeline where profiling revealed performance could be improved was computing the bounding boxes of paths. I was using atomic min/max for each path segment to compute bbox union, and that was expensive (1.3ms for paris-30k on M1 Max).
I figured out a way to do it with monoids instead of atomics, and it's now 400µs. The trick is "segmented reduction" which works amazingly well.
I might blog this, but not sure yet; other things to do.
@raph This is doing the reduction with min/max in a minimal-depth tree instead of starting from one end?
Doesn't sound too hairy, even though you called the project Vello.
@radehi This is not the tree stuff (which is hairy). Here we have a sequence of paths, and each path is a sequence of segments. For each path you want the bbox, which is the union of the bboxes of its segments.
The trick is to concatenate all segments into one big vector, then have a "tag stream" (1 byte/seg) that has a bit indicating path end. Compute bbox using a monoid scan, then write when the path end bit is set (stream compaction).
And there's another fun detail that makes the scan fast.
@raph You mean, if you're doing prefix sum with addition, you have to add the base for the workgroup to all the 256 or however many items are in the workgroup, but in this case the min/max from the previous workgroup only propagates up to the first path end, and ultimately you only care about the values at path ends?
@radehi Precisely that. The fact that you're doing stream compaction after the raw monoid scan saves you a huge amount of work in this case.