Satisfying work for the day. One of the items in the Vello pipeline where profiling revealed performance could be improved was computing the bounding boxes of paths. I was using atomic min/max for each path segment to compute bbox union, and that was expensive (1.3ms for paris-30k on M1 Max).

I figured out a way to do it with monoids instead of atomics, and it's now 400µs. The trick is "segmented reduction" which works amazingly well.

I might blog this, but not sure yet; other things to do.

@raph This is doing the reduction with min/max in a minimal-depth tree instead of starting from one end?

Doesn't sound too hairy, even though you called the project Vello.

@radehi This is not the tree stuff (which is hairy). Here we have a sequence of paths, and each path is a sequence of segments. For each path you want the bbox, which is the union of the bboxes of its segments.

The trick is to concatenate all segments into one big vector, then have a "tag stream" (1 byte/seg) that has a bit indicating path end. Compute bbox using a monoid scan, then write when the path end bit is set (stream compaction).

And there's another fun detail that makes the scan fast.

@raph Hmm, but by itself monoid scan (APL scan, right? prefix sum) will give you maxes or mins from before the previous path end. I guess you need a weird max (or min) that takes the path-end bit into account, so that max(End(3), NotEnd(2)) is NotEnd(2), and then any standard prefix sum algorithm gives you the right answer?

@radehi That's right, the monoid is not just min/max, but also incorporates the path end bit. That trick is explained in section 1.5 of Blelloch's 1993 "Prefix sums and their applications": cs.cmu.edu/~guyb/papers/Ble93.

Follow

@raph Neat, I never read the whole paper but I might have read that part.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.