Page MenuHome

Edit Mesh: Parallelize bounding box calculation in edit mode
AbandonedPublic

Authored by Hans Goudey (HooglyBoogly) on May 12 2022, 5:49 PM.

Details

Summary

Parallelize the bounding box calculation for edit meshes.
In my simple test with the transform tools on a Ryzen 3800x
with a 1.5 million vertex cube, the bounding box calculation
became about 3x faster, but the overall improvement is only
a few percent.

This commit is similar to rB0ba061c3bc9b, rB6a71b2af66cf,
and rB6d7dbdbb44f3.

BeforeAfter
Average9.7 ms3.4 ms
Min9.1 ms2.8 ms
FPS5.94 FPS6.1 FPS

Diff Detail

Repository
rB Blender

Event Timeline

Hans Goudey (HooglyBoogly) requested review of this revision.May 12 2022, 5:49 PM
Hans Goudey (HooglyBoogly) created this revision.

Did you check the worst case performance (if the array of vertices doesn't exist & needs to be created each time - I'd assume this would be slower).

It would be good to compare this with BLI_task_parallel_mempool (even if not as fast, it doesn't have the down side of requiring the array).

I'm not so keen on having bound-box calculation to _require_ an array of vertices to be created, perhaps the multi-threaded operation could only be used when this already exists.

Did you check the worst case performance (if the array of vertices doesn't exist & needs to be created each time - I'd assume this would be slower).

I checked with the bevel tool, since that seems to make the index array dirty.

BeforeAfter
Average2.9 ms2.3 ms
Min2.7 ms2.2 ms

So it's actually faster, even if the index array has to be recomputed.

I'm not so keen on having bound-box calculation to _require_ an array of vertices to be created, perhaps the multi-threaded operation could only be used when this already exists.

On the contrary, I see this as a strength of this approach. Using indices is the simplest way to write a parallel algorithm. I think Edit mesh / BMesh code should use that approach more.
The complexity of using BLI_task_parallel_mempool for a reduction like this would mean I'm not so interested in working on this honestly.
Plus, the cost is amortized; if the index array is generated here, it can be used for drawing code, conversion to mesh for modifiers, or anything else (maybe including simplifying attribute retrieval from BMesh).
The memory cost of an array of indices is quite small compared to other things.

In view of the many uses of BLI_task_parallel_mempool in the BMesh code, and the possibility of optimization using the already allocated BMVerts array, perhaps an ideal solution would be to create a utility (perhaps in something like "tools/bmesh_threading.hh") that performs a parallel algorithm simply and efficiently considering whether that array already exists.

Hmm, honestly I think the focus on avoiding the array of pointers is premature and unnecessary.
The performance requirements of a single simple loop over vertices after a topology change is quite small.
The memory cost of 8 bytes per vertex obviously isn't great, but I sort of doubt it's significant here.

I'd rather not apply this patch, while I can see arguments in its favor, there are some down sides too.

The main down side is that it moves BMesh in the direction of favoring static geometry, adding a potential penalty for performing destructive operations.
In some cases we can't avoid it - which is why the ability to create these tables exists, but I'd rather not making this the common code-path unless it's needed because it increases the chance destructive operations are going to be hampered by causing lookup tables to be regenerated.

From my own tests speedup from this patch isn't so much if building the lookup table needs to be generated too and can be slower on systems with fewer CPU cores available - although that's not such a fair comparison in the case of bounding box calculation as in many cases the table will be valid.

Testing an alternative (which I'll submit as a separate patch), similar gains from this patch without having to allocate tables.

Avoiding allocating tables has the advantage that threading for lite-weight operations like calculating bounds can be used in more places without having to take into account the possibility of re-generating lookup tables.


In my tests it's possible to get roughly half of the speedup from this patch by using BM_iter_parallel see: P3088, iterating over elements in the mempool directly (without function calls) gets similar gains to this patch - although this has some pros/cons that need to be considered too.

Did you check the worst case performance (if the array of vertices doesn't exist & needs to be created each time - I'd assume this would be slower).

I checked with the bevel tool, since that seems to make the index array dirty.

BeforeAfter
Average2.9 ms2.3 ms
Min2.7 ms2.2 ms

So it's actually faster, even if the index array has to be recomputed.

This depends a lot on the number of cores available, in my tests with 4-8 cores the speed is about the same.

Sorry to continue being contrary, but personally I still find the approach of paying for the table once in order to simplify all code afterwards a better approach.
The specific bounds calculation operation is faster if we can avoid making the table, but the cost of doing that is amortized as every parallel iteration in the future (drawing code, transform code, etc) can use that table later on.
For example, the drawing code (mesh_render_data_create) already calls BM_mesh_elem_index_ensure, so it has to be done at some point anyway.
The simplicity of using the table is helpful too; I'd argue code like const BMVert &vert = *BM_vert_at_index(bm, i); in a loop is much easier to understand.
Knowing the indices for BMesh elements makes it much simpler to interface with elsewhere in Blender, where using arrays of indices and simple parallel loops has proven to be a useful optimization technique.

For performance, topology-changing operations tend to be much more expensive anyway, meaning building of the table will be relatively less of a bottleneck. Leaving performance on the table for every other case when the table isn't dirty seems like a loss to me.
I guess we're heading in a fairly theoretical direction. Your comment about not favoring static geometry for BMesh does make some sense.

I see what your saying and in some when the tables are ensured to created for drawing, this makes sense.

My concern remains that that this adds a constraint where we need to predict how functions are called to make sure we don't run into situations where the cost of generating lookup tables wont be amortized.
When this patch is compared with D15499, the speed difference is small to the point I don't think it's worth having to split hairs about when we should/shouldn't be generating lookup tables.
Instead, we can have have fast iteration that doesn't depend on tables and use it in similar places (counting selection, is another candidate where function call overhead is likely measurable).

If this patch gave significant speedup we could note in the code comments that this is an exception to the current rule of thumb and explain why this needs to be an exception, but it doesn't seem to be the case.

Okay, your points are fair. I'll close this patch. I guess in the future if it's really helpful we can specialize code for whether we want to use the tables or not. Maybe even keeping track of whether they will be used elsewhere anyway. But like you say, it doesn't seem to be worth it currently (disregarding the simplicity argument anyway).

I wonder if we could speed up the creation of the tables given some more information from the internals of the mempool.