Page MenuHome

Mesh: Parallelize bounding box calculation
ClosedPublic

Authored by Hans Goudey (HooglyBoogly) on Dec 14 2021, 3:20 PM.

Details

Summary

This replaces the single-threaded calculation of mesh min and max
positions with a parallel_reduce loop. Since the bounding box
of a mesh is retrieved quite often (at the end of each evaluation,
currently 2(?!) times when leaving edit mode, etc.), this makes for a
quite noticeable speedup actually.

On my Ryzen 3700x and a 4.2 million vertex mesh, I observed
a 4.4x performance increase, from 14 ms to 4.4 ms.

I added some methods to float3 so they would be inlined, but
they're also a nice addition, since they're used often anyway.


If this is accepted, I'll look into updating the bounding boxes for some
other object types, mesh needs a special implementation because
MVert != float3 for now though ;).

Diff Detail

Repository
rB Blender
Branch
mesh-parallelize-bounding-box (branched from master)
Build Status
Buildable 19581
Build 19581: arc lint + arc unit

Event Timeline

Hans Goudey (HooglyBoogly) requested review of this revision.Dec 14 2021, 3:20 PM
Hans Goudey (HooglyBoogly) created this revision.
Hans Goudey (HooglyBoogly) planned changes to this revision.Dec 16 2021, 8:09 PM

@Ray Molenkamp (LazyDodo) suggests using minmax_v3v3_v3_array, which seems to inline minmax_v3v3_v3 on each element while this doesn't.

i tested a little minmax_v3v3_v3_array didn't work since the data isn't in a continuous chunk of ram (there's some other fields in Mvert) getting minmax_v3v3_v3 to inline gave a respectable perf boost though.

base       BKE_mesh_minmax 1572866 verts in 9.709300 ms
inline     BKE_mesh_minmax 1572866 verts in 4.150400 ms
D13572     BKE_mesh_minmax 1572866 verts in 2.078700 ms
D13572_inl BKE_mesh_minmax 1572866 verts in 1.847800 ms

inlining looks like it lost its effectiveness with D13572, but it appears we're running into some other limitation (best guess we're starting to be memory bound) running blender in single threaded mode (-t 1) shows it's definitely inlining and having a benefit.

D13572 -t 1    BKE_mesh_minmax 1572866 verts in 9.843400 ms
D13572_inl -t1 BKE_mesh_minmax 1572866 verts in 3.420200 ms
  • Merge branch 'master' into mesh-parallelize-bounding-box
  • Inline min and max functions, final cleanup
Hans Goudey (HooglyBoogly) retitled this revision from Mesh: Parallelize bounding box calculation (WIP) to Mesh: Parallelize bounding box calculation.Dec 22 2021, 12:54 AM
Hans Goudey (HooglyBoogly) edited the summary of this revision. (Show Details)
Hans Goudey (HooglyBoogly) edited the summary of this revision. (Show Details)
Jacques Lucke (JacquesLucke) accepted this revision.EditedDec 22 2021, 11:52 AM

Looks good. Note that there is a way to compute the min *and* max with fewer comparisons. No idea if this has any effect in runtime.

The basic idea to reduce comparisons is to (1) always process two elements together and (2) use the fact that we compute the minimum AND the maximum.

void compute_min_max_simple(float a, float b, float &min, float &max) {
  if (a < min) min = a;
  if (a > max) max = a;
  if (b < min) min = b;
  if (b > max) max = b;
}

void compute_min_max_new(float a, float b, float &min, float &max) {
  if (a < b) {
    if (a < min) min = a;
    if (b > max) max = b;
  }
  else {
    if (b < min) min = b;
    if (a > max) max = a;
  }
}

As you can see, in the new variant there are only three comparisons in every iteration, while in the first variant there are four.
No idea if this can actually improve performance. It could potentially make it much worse as well by reducing the effectiveness of cpu branch prediction. The conditions in the first algorithm are almost always false, making them easy to predict. In the second algorithm the first condition is much more random usually.
Could be interesting to try this at some point, just wanted to throw it out here :)


I noticed that in some test cases the execution time of the Bounding Box node is much slower than the actual call to BKE_mesh_minmax (400 ms in node, but only 60 ms computing the bounding box). In those cases the majority of the time is spend freeing the input geometry after the bounding box has been computed. Would be interesting to check if we could let another thread do the freeing in some cases.
An alternative would be to use get_input instead of extract_input when getting the geometry in this node. Then the node wouldn't be responsible for freeing the input geometry. That doesn't affect the actual run-time yet, but it could make the time reported to the user more useful.

source/blender/blenkernel/intern/mesh.cc
1594

blender:: is unnecessary.

This revision is now accepted and ready to land.Dec 22 2021, 11:52 AM

Interesting! I tried implementing that, and didn't see any speedup (actually a bit slower, 3.21 ms average vs 3.17ms).
Could be the way I implemented it, it's also a bit more complex though, so I think I'll stick with the current patch.

Agreed, it would be good to look into freeing on a separate thread.

Would be interesting to check if we could let another thread do the freeing in some cases.

Careful there, ideally you'd want the memory requirements for a specific graph to be deterministic, moving de-allocations to a background thread will end that determinism , if the buffers are large enough could lead to problems down the road that be "less fun' to track down.