**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 04:55

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 04:55

Raph Levien @raph@mastodon.online

Jan 11, 2023, 04:55

Satisfying work for the day. One of the items in the Vello pipeline where profiling revealed performance could be improved was computing the bounding boxes of paths. I was using atomic min/max for each path segment to compute bbox union, and that was expensive (1.3ms for paris-30k on M1 Max).

I figured out a way to do it with monoids instead of atomics, and it's now 400µs. The trick is "segmented reduction" which works amazingly well.

I might blog this, but not sure yet; other things to do.

**R. A. Dehi** @radehi@qoto.org · Jan 11, 2023, 05:02

**R. A. Dehi** @radehi@qoto.org · Jan 11, 2023, 05:02

Jan 11, 2023, 05:02

R. A. Dehi @radehi@qoto.org

@raph This is doing the reduction with min/max in a minimal-depth tree instead of starting from one end?

Doesn't sound too hairy, even though you called the project Vello.

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:07

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:07

Jan 11, 2023, 05:07

Raph Levien @raph@mastodon.online

@radehi This is not the tree stuff (which is hairy). Here we have a sequence of paths, and each path is a sequence of segments. For each path you want the bbox, which is the union of the bboxes of its segments.

The trick is to concatenate all segments into one big vector, then have a "tag stream" (1 byte/seg) that has a bit indicating path end. Compute bbox using a monoid scan, then write when the path end bit is set (stream compaction).

And there's another fun detail that makes the scan fast.

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:11

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:11

Jan 11, 2023, 05:11

Raph Levien @raph@mastodon.online

@radehi That detail is that you can do almost all the work intra-workgroup. There's only one item per wg that needs fixup after, so the number of threads in that dispatch is only 1/256 of the main work. And this works whether the # of segments in a path is large, small, or some mixture.

That's not something you can do for general monoid scan, but it works for this. It feels like magic, and it looks *great* when actually profiled.

**R. A. Dehi** @radehi@qoto.org · 2023-01-11T05:18:25Z

R. A. Dehi @radehi@qoto.org

@raph You mean, if you're doing prefix sum with addition, you have to add the base for the workgroup to all the 256 or however many items are in the workgroup, but in this case the min/max from the previous workgroup only propagates up to the first path end, and ultimately you only care about the values at path ends?

Jan 11, 2023, 05:18 · · · ·

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:20

**Raph Levien** @raph@mastodon.online · Jan 11, 2023, 05:20

Jan 11, 2023, 05:20

Raph Levien @raph@mastodon.online

@radehi Precisely that. The fact that you're doing stream compaction after the raw monoid scan saves you a huge amount of work in this case.

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…