@raph Hmm, but by itself monoid scan (APL scan, right? prefix sum) will give you maxes or mins from before the previous path end. I guess you need a weird max (or min) that takes the path-end bit into account, so that max(End(3), NotEnd(2)) is NotEnd(2), and then any standard prefix sum algorithm gives you the right answer?
@radehi That's right, the monoid is not just min/max, but also incorporates the path end bit. That trick is explained in section 1.5 of Blelloch's 1993 "Prefix sums and their applications": https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
@raph Neat, I never read the whole paper but I might have read that part.
@raph Parallel prefix sum with non-commutative monoids is underappreciated.
@radehi That detail is that you can do almost all the work intra-workgroup. There's only one item per wg that needs fixup after, so the number of threads in that dispatch is only 1/256 of the main work. And this works whether the # of segments in a path is large, small, or some mixture.
That's not something you can do for general monoid scan, but it works for this. It feels like magic, and it looks *great* when actually profiled.
@raph You mean, if you're doing prefix sum with addition, you have to add the base for the workgroup to all the 256 or however many items are in the workgroup, but in this case the min/max from the previous workgroup only propagates up to the first path end, and ultimately you only care about the values at path ends?
@radehi Precisely that. The fact that you're doing stream compaction after the raw monoid scan saves you a huge amount of work in this case.
@radehi This is not the tree stuff (which is hairy). Here we have a sequence of paths, and each path is a sequence of segments. For each path you want the bbox, which is the union of the bboxes of its segments.
The trick is to concatenate all segments into one big vector, then have a "tag stream" (1 byte/seg) that has a bit indicating path end. Compute bbox using a monoid scan, then write when the path end bit is set (stream compaction).
And there's another fun detail that makes the scan fast.