Page MenuHome

Sculpt: Optimize PBVH mesh normal updates
Needs ReviewPublic

Authored by Pablo Dobarro (pablodp606) on Nov 24 2020, 12:18 AM.

Details

Summary

The current method to update the normals does the following:

  • Allocate an array the same size of mverts regardless of the normals that need to be updated
  • Loop over the faces to accumulate the normals using an atomic_add into the array
  • Loop again over vertices to normalize the accumulated normal and store it in the vertex
  • Free the array

If the SculptSession already contains a pmap that vertices to polys, all
these steps can be skipped and the accumulation can be do in a single
loop over the affected vertices.

Using the scale mesh filter in a Threadripper 3990X to modify an entire
mesh with different vertex counts I get the following results:

verticescurrentpmapImprovement
3119060.0294620.00146320.13
12478180.0504630.00426711.82
49904660.0739090.0143695.14

I added an alternative implementation with two loops to calcualte and store the
normals to avoid redundant calculations. These are the results:

Threadripper 3990X

CurrentNew single loopNew calc + store
300K0.0294620.0014630.001511
1M0.0504630.0042670.005809
4M0.0739090.0143690.021079

i7 9750H

CurrentNew single loopNew calc + store
300K0.0077420.0040920.003225
1M0.0301120.0152120.014419
4M0.1222490.0573130.053424

Note: I left the timer code in case you want to check if this
improvement also applies in other hardware. In order to test, just run
the mesh filter after entering sculpt mode to use the current update
method, then use smooth to allocate the pmap and run the test again.

Diff Detail

Repository
rB Blender
Branch
sculpt-fast-normals (branched from master)
Build Status
Buildable 11401
Build 11401: arc lint + arc unit

Event Timeline

Pablo Dobarro (pablodp606) requested review of this revision.Nov 24 2020, 12:18 AM
Pablo Dobarro (pablodp606) created this revision.

From my understanding this patch is trading an overhead of atomic_add_and_fetch_fl with redundant face-normal calculation. The former is faster on not-so-many-cores machines, the latter is faster on machines with a lot of cores (such as Threadripper). If there is no speed penalty on a lower number-of-cores machines this might be acceptable trade-off.

Here is another idea to be tested before making a call here:

  1. Allocate single array of num_faces float3 elements.
  2. Run first threaded pass over nodes, and calculate faces normals of changed nodes.
  3. Run second pass over nodes, and do same normal calculation as you do now, but use pre-calculated poly normal from the array.

There is still an array allocation, but its size will be considerable lower than array allocated for num_vertices.

In theory this should avoid any possible balance disruption between atomics vs. redundant calculation on lower-core-count machines, and might as well improve Threadripper case further.
This approach still requires an "external" pmap. But maybe it's not too bad.

  • Implement pmap updates with calculate + store loops
Pablo Dobarro (pablodp606) planned changes to this revision.Nov 24 2020, 5:51 PM

I update the results, I would rather choose the single loop approach

  • The performance gain in the Threadripper is higher than the loss in the i7
  • This test are done with the mesh filter, which is the worst case scenario were the entire mesh needs normal updates. In case of working with brushes with small radius and low spacing (where the biggest performance issues are), the single loop approach can skip normal calculations per vertex based on the ME_VERT_PBVH_UPDATE flag, while the first loop of the two loops version calculates the normals for the entire node.

Please always use clang-format when working on code.

The performance gain in the Threadripper is higher than the loss in the i7.

I don't really understand why we should trade performance like this. It is more likely Blender users and typical artists will have about 8-12 cores.
It should be possible to improve performance on both classes of hardware.

The single loop approach can skip normal calculations per vertex based on the ME_VERT_PBVH_UPDATE flag.

It is still possible in any approach by checking flags of adjacent vertices.

What is unexpected to me is the amount of performance loss for the Threadripper with what you call "2 loop approach". Is it because of computing normals of faces which are not adjacent to modified vertex?
Or is it threading overhead inside of TBB is so noticeable?

Some ways to make numbers more reliable/trustworthy:

  • Set CPU governor to performance.
  • Use TIMEIT_START_AVERAGED / TIMEIT_END_AVERAGED.
  • review update, add check for tagged vertices

I did the tests again with your suggestions and the results are similar, the threadripper is always slower with the two loops approach.
It now has a check to calculate only the necessary normals in the first loop, but this won't have any effect with the mesh filter because all normals need to be updated.

If you think that the 2 loops version is safer, I can remove the 1 loop version to finish the patch (any version is still about 10 times faster than the current implementation, so i t should be fine).

I always noticed that adding multithreaded loops over the nodes using TBB makes any sculpt function slower both in Intel and AMD, especially when dealing with a low number of nodes. The mesh filter is the worst case scenario when measuring performance, but it is not the most common use case for sculpt mode. I would rather choose optimizations that prioritize modifying a small number of vertices as fast as possible instead of modifying the maximum number of vertices as fast as possible.

When the amount of elements is small, especially when complexity of handling element is low, attempts to use multi-threading will indeed degrade performance. Threading always have some amount of overhead, so you only should be using threading when overhead of threading is smaller than the performance gain.

Being suboptimal for a less used CPU configuration among userbase, and being more optimal for more commonly used CPU configuration seems more reasonable to me. Especially since the solution then also behaves optimal for an arbitrary mesh topology (with a lot of n-gons, where redundant calculations of face normals is undesirable).

There might be some trick to make both configurations optimal, but can not immediately think of concrete ideas.

Adding Bastien, so that he can give suggestions here as well.

I think we could do way better with original pbvh_faces_update_normals code, if this assumption is true:

A vertex may be used by several PBVH nodes, but in typical common cases, the vast majority of vertices will only be used by one node.

Then, you can have a vnor array per PBVH node (since not many vertices are shared, it would be roughly use same amount of memory as current vnors in PBVH), and in threads you just add computed fnors to node-local vnors.
An intermediary mono-threaded task added to pbvh_update_normals_accum_task_cb could then move/add those per-node vnors to global PBVH vnors.

In fact, we could even be smarter, and keep info of vertices that are actually shared between more than one node. Then you can do regular addition directly into pbvh_update_normals_accum_task_cb for the majority of non-shared vertices, and deal only with shared ones in the intermediary non-threaded task.

source/blender/blenkernel/intern/pbvh.c
1217

This optimization can be added to orig code as well I think (in pbvh_update_normals_accum_task_cb)?