Page MenuHome

Mesh: optimize normal calculation
ClosedPublic

Authored by Campbell Barton (campbellbarton) on Jul 22 2021, 7:31 AM.
Tags
None
Tokens
"Like" token, awarded by EAW."Yellow Medal" token, awarded by franMarz."Love" token, awarded by erickblender."Love" token, awarded by gilberto_rodrigues."Love" token, awarded by easythrees.

Details

Summary

Optimize mesh normal calculation.

  • Remove the intermediate lnors_weighted array, accumulate directly into the normal array using a spin-lock for thread safety.
  • Remove single threaded iteration over loops (normal calculation is now fully multi-threaded).
  • Remove stack array (alloca) for pre-calculating edge-directions.

Summary of Performance Characteristics:

  • The largest gains are for single high poly meshes, with isolated normal-calculation benchmarks of meshes over ~1.5 million showing 2x+ speedup, ~25 million polygons are ~2.85x faster.
  • Single lower poly meshes (250k polys) can be ~2x slower.

    Since these meshes aren't normally a bottleneck, and this problem isn't noticeable on large scenes, we considered the performance trade-off reasonable.
  • The performance difference reduces with larger scenes, tests with production files from "Sprite Fight" showing the same or slightly better overall performance.
NOTE: tested on a AMD Ryzen TR 3970X 32-Core.

For more details & benchmarking scripts, see the patch description.


Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,

The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.

  • Lock the entire vector before adding (instead of 3x atomic_add_and_fetch_fl calls per vertex).
  • Vertex accumulation runs in parallel.

Measurements for BKE_mesh_calc_normals_poly


Measurements for Time Between Updates (1/FPS)

This includes the time for a simple-deform modifier.


Measurements for Time Between Updates (1/FPS) 16x linked duplicates

To show how performance compares with multiple objects being calculated at once.

This includes the time for a simple-deform modifier which was used on each object.

Measurements for Time Between Updates (1/FPS) High Poly (1-10million)

This includes the time for a simple-deform modifier.


(P2291 used to generate test data, P2283 to benchmark the test data).

  • Tested using 32 cores (64 threads), average of 10 calls.
  • NOTE: above 1 million, steps of 100k were used instead of 10k.

Measurements with Production Files

Tested with sprite_fight r5056 (single maximized viewport, all overlays disabled except for text) using the best FPS after playing back for a few seconds.

FileBeforeAfter
pro/shots/130_snacks/130_0060_A/130_0060_A.anim.blend6.96 fps7.05 fps
pro/shots/110_rextoria/110_0170_A/110_0170_A.v003polish.anim.blend6.15 fps16.25 fps
pro/shots/110_rextoria/110_0090_A/110_0090_A.anim.blend5.18 fps5.18 fps

Observations

  • In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
  • The overhead of locking was negligible (replacing with add_v3_v3 didn't make a significant difference to performance).
  • Changing optimization flags didn't make a significant difference (-O2, -O3, -Ofast, both GCC and CLANG gave comparable results).
  • Manually limiting the threads to 4 shows much less variation in performance.
  • TaskParallelSettings.min_iter_per_thread is set to 1024, so tests on low-poly meshes will not saturate the CPU cores if (cores * 1024 > poly_count), in my case this makes tests with meshes with < 64k polys give *noisy* results.
  • Comparing the "Before" times, there is some noise in these tests, in practice the time comparisons were reproducible, where slowdowns and speedups were fairly close on each execution.
  • At around 650k polygons, this patch starts to perform better than master.
  • With multiple objects both the speedup for high-poly meshes & penalty for lower poly meshes reduce significantly.
  • Micro-optimizations in calculating normals make no significant difference, tested the following changes:
    • Unrolling tri/quad face code-paths (removed alloca for edge-vectors).
    • Replace BLI_task with TBB.
    • Use madvise to help the system hint at the kinds of memory access to expect.

Further Work (out of scope for this patch)

Noting some additional areas of investigation that I consider out of scope for this patch.

  • While reducing the TaskParallelSettings.min_iter_per_thread increases performance for low-poly meshes it may cause problems in production files with many objects. I consider it out of scope for this patch since the gains are relatively minor in situations where performance typically isn't an issue, changing this value can be proposed tested & independently of this patch.
  • Using fast_acos with a 0.000067 error margin gives a small but measurable improvement (exact speedup would need further testing since it's near level of noise - approx 3-6%) see P2290.
  • Using single threaded normal calculation in situations where many objects are being calculated at once (removing the need for locking in situations where scenes already distribute calculations across many objects).

Alternative Optimizations

NOTE: None of the alternatives are especially promising, keeping for reference.

Since this can be a significant bottleneck, it's worth investigating alternative optimizations.

So far most alternatives perform worse however there are some minor improvements to the current method that could be extracted into a separate patch.

In-Line Vertex Normalize (P2284)

Track the number of un-handled loops per vertex using atomic int subtraction.

This method trades a separate vertex normalization multi-threaded pass for a pass that counts the number or loops for each vertex

This can be updated using atomic operations, once there are no loops left the vector is normalized as part of accumulating (instead of a separate loop).

This reduces lock contention:

  • The last addition before normalizing doesn't need to lock.
  • The additional CPU cycles for inline normalization could make it less likely two threads access vectors at the same time.

With 25-million faces, counting this takes roughly 10% of the overall time.

FileBeforeAfterSpeedup
00064k.blend0.0019620.0058200.337056
00128k.blend0.0036660.0111200.329671
00256k.blend0.0073270.0163280.448751
00512k.blend0.0148680.0249710.595429
01024k.blend0.0297690.0374300.795347
02048k.blend0.1098710.0518412.119380
25000k.blend0.5177150.2070942.499900
Per-Thread Vertex Range (P2285)

This avoids locking by each thread looping over all polygons, only operating on vertices that are in their non-overlapping vertex index range:

Comparison with master

FileBeforeAfterSpeedup
00064k.blend0.0022430.0056800.394806
00128k.blend0.0037240.0126530.294345
00256k.blend0.0076190.0217230.350754
00512k.blend0.0149940.0338810.442554
01024k.blend0.0315650.0590810.534278
02048k.blend0.1121710.1085021.033816
25000k.blend0.5211311.0672260.488304

Optimizations to the Current Method

Store Loop Weights Instead of Vectors (P2286)

This is only a minor change to the current behavior

Instead of storing the vectors for each loop, store their weight,
which is then used with the polygon normal when accumulating into the vertex.

On average it's slightly faster and uses less memory.

FileBeforeAfterSpeedup
00064k.blend0.0021820.0020491.065078
00128k.blend0.0034160.0038510.887053
00256k.blend0.0077990.0069611.120338
00512k.blend0.0152250.0149621.017578
01024k.blend0.0308780.0290431.063176
02048k.blend0.1127910.1197700.941730
25000k.blend0.5254420.5094231.031446
Store Loop Weights Instead of Vectors (+ Cache Edge Vectors) (P2287)

A version of "Store Loop Weights Instead of Vectors" that also caches unit length edge vectors.

FileBeforeAfterSpeedup
00064k.blend0.0021700.0022840.949978
00128k.blend0.0037690.0048150.782709
00256k.blend0.0076040.0087730.866766
00512k.blend0.0151950.0194650.780611
01024k.blend0.0307260.0409560.750213
02048k.blend0.1111970.1361060.816986
25000k.blend0.5193130.6395030.812057
Store Loop Weights Instead of Vectors (+ fast_acos & micro-optimizations) (P2290)

This patch uses minor optimizations to get the most out of the current method, this gives a ~13% overall speedup when averaging results, with some tests ~43% faster.

  • Use fast_acos (less precise with 0.000067 error margin).
  • Remove alloca, instead store previous/next edge-vectors, re-using the last to avoid an extra call to normalize_v3.
  • Use one loop instead of two in mesh_calc_normals_poly_prepare_cb since the polygon normal isn't needed for calculating weights.
  • Set min_iter_per_thread to 64 instead of 1024.
FileBeforeAfterSpeedup
regular_normals_00064k.blend0.0017920.0016211.105090
regular_normals_00128k.blend0.0031820.0031111.022553
regular_normals_00256k.blend0.0067360.0057741.166509
regular_normals_00512k.blend0.0142300.0132101.077208
regular_normals_01024k.blend0.0381110.0266301.431143
regular_normals_02048k.blend0.1069460.1064111.005024
regular_normals_25000k.blend0.5170870.4726831.093942

Thanks to @Jagannadhan Ravi (easythrees) for helping investigate this patch.

Diff Detail

Repository
rB Blender
Branch
master--normal-use-lock (branched from master)
Build Status
Buildable 16161
Build 16161: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In our lab, I see this for performance improvements, which is pretty good. I test with a relatively dense dragon model with an animated bend modifier applied.

Suggest to put this patch on hold, or - we could use this method when there are at least a certain number of loops per thread, since it does give measurable speedup.

@Campbell Barton (campbellbarton), you mention that vertex accumulation now runs in parallel. Is is something what can be done in the current code top improve performance?

Not trivially, since we have an array of data to accumulate in destination indices.

Some options I've considered:

  • Having access to a vertex -> loop map (what D11906 does).

    Note that BMesh is already doing this since no data generation is needed.

    This is only worthwhile if the map can be reused many times as generating it would likely be single threaded (multi-threading runs into a similar problem we've running into here).
  • Accumulating into TLS arrays would work, but the memory for each thread and cost merging the data back with many threads makes it impractical.
  • It's also possible to accumulate vector additions into a granular data structure in TLS, bucketing to ensure thread safety when they're written back.

    Mentioning for completeness as the overhead makes it impractical AFAICS.
  • Locking the vector addition (what this patch does).

But we could also try an alternative solution: each task loops over the full set of faces, with a predefined, 'private' range of vertex indices, and only contribute to normal computing for loops that match 'their' vertices.

This would require always computing poly normals first in their own parallelized loop, and then in a second one we'd accumulate normals in vnors and normalize them in a single step.

I would expect looping over faces/loops that need no processing (since out of our own vertex indices range) to be extremely fast (it's essentially a few integers increments and comparisons), compared to actual weighted normals computing. And that way we avoid any need for locking/atomics/TLS? That solution should also be fully scalable both regarding mesh size, and threads count. Added benefit could be that those tasks being much bigger (each would loop over all polys, then all of their own 'private' vnors for final normalization and conversion to short), the threading overhead would be much less noticeable.

@Campbell Barton (campbellbarton) what do you think of that idea? Only thing am unsure about it (without trying it in real) is the overhead of looping over all faces/loops, even if each task does nothing for most of them?

Another idea could be to generate virtual islands of the mesh (groups of faces/vertices fully isolated, with no possible concurrent access), and then process all the boundary data in a single task. But that would be fairly complex I recon, and preparing those islands would likely add a significant overhead too.

@Bastien Montagne (mont29) it could be worth testing.

Some issues could be:

  • It gets less efficient as more threads are added (if you have many loops at once over all polygons, each loop has overhead even if it's reasonably small it could add up).
  • Some duplicate work calculating normals of multiple polygons (unless polygon normals are stored for reuse).
  • Some duplicate work calculating adjacent normalized edge vectors that currently get reused while walking around the polygon.
  • Efficiently calculating next/previous normalized edge vectors in cases were loops might not be used could get messy.
  • Some threads may take longer than others, where idle threads are left waiting for others to finish. This might be one of the biggest down-sides, as the size of meshes increases the chance that one thread takes longer than others increases too. Are there ways to avoid this problem?

@Bastien Montagne (mont29) @Campbell Barton (campbellbarton) do you have a pointer to a model and workflow where I can profile the "low" point here? Maybe I can help sussing out why 128K faces is a local maximum.

NOTE: while this update could theoretically give performance benefits, the overall result was worse, so rolled back this change, see: In-Line Vertex Normalize - alternative optimizations for reference.
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

Revert back to simple non vertex-loop counting logic as it performed better.

Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

Last week I tried some alternative optimizations, these have been listed in the main task for completeness since for the most-part the results aren't promising.

One change tested resulted in a very minor speedup over the current method by storing the loop weights instead of scaled polygon-normal's (listed in the main task as Store Loop Weights Instead of Vectors).

Since the difference is so small, I'd rather not make this change for now.

Before jumping to conclusions about the current timing, I'd like to check how other systems perform.

Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

Remove alloca use calculating edge-vectors (increases performance ~2-5%).

Graphs have been updated to include this change.

Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

Am still fairly hesitant regarding adding back atomics here... But yes, performances as reported here seem fine, maybe CPU architectures have become much better at dealing with extra cost of atomics? What was exactly the CPU used for those tests btw?

In any case, I'd like to see one more testcase checked about this patch: what happens with a lot of relatively small meshes (say, 1k objects made of 1k vertices) ? As far as I understand you only tried with 16 objects, and it essentially made both old and new code overall almost as fast?

  • These tests ran on a AMD Ryzen Threadripper 3970X 32-Core Processor (using 64 threads).
  • Testing many 1k objects is these will only get one thread as min_iter_per_thread is set to 1024. I think it's reasonable to use production files that include many small objects as a test case, since a large number of exactly 1k poly objects is an artificial test case that could give unrealistically skewed results.

If the aim is to generate a mesh with the greatest lock contention, that'd be a single mesh with (number-of-system-threads * min_iter_per_thread vertices), if you have many of these objects, the lock contention will reduce since the threads will be working on different object.

In the generated graphs that'd be a single 64k mesh having the most locking, if that were the case a 400k mesh should have ~1/6th the locking contention (roughly).

Further, swapping out the locking and non-locking versions of these functions hardly made a difference, although if you're concerned about this it might be worth generating a graph that doesn't use locking, to see if there is a situation that happens to be running in to lock contention.

Another useful test would be to benchmark a complex production file, compare the frame-rate with/without locking.

Edit: tested some files from sprite-fight with/without locking and I couldn't notice any difference from the FPS.

Here is a quick test on two sprites production scenes, every time just Load + 10 frames update, for BKE_mesh_normals_loop_split performances.

Ran on an AMD Ryzen 7 3800X 8-Core, 16 threads.

Filemaster rB6aae14027828f8D11993
130_0220_A.lighting.blend (r5341)0.007397 (total: 3.565217, in 482 runs)0.005041 (total: 2.429869, in 482 runs)
030_0020_A.lighting.blend (r5341)0.001710 (total: 5.133853, in 3003 runs)0.001707 (total: 5.127394, in 3003 runs)

This is of course fairly limited test, but this patch seems to be either neutral, or beneficial (although normals evaluation as a whole represents less than 0.1% of a frame evaluation time in those scenes anyway).

To be clear, I am not worried about potential locking (concurrency issues between threads) here. I am worried about the extra cost induced by an atomic operation, regardless of whether there are actually collisions or not.

Atomics used to induce a fairly high overhead (due to cache coherency constraints), that was strongly reducing the benefits of threading on light tasks as those normals computing, especially with very high number of threads. That's why they were removed in rBd130c66db436 four years ago.

From all that testing, it would appear that this might be much less of an issue with modern processors? Would be interesting to check on a recent Intel architecture though.

To be clear, I am not worried about potential locking (concurrency issues between threads) here. I am worried about the extra cost induced by an atomic operation, regardless of whether there are actually collisions or not.

Atomics used to induce a fairly high overhead (due to cache coherency constraints), that was strongly reducing the benefits of threading on light tasks as those normals computing, especially with very high number of threads. That's why they were removed in rBd130c66db436 four years ago.

From all that testing, it would appear that this might be much less of an issue with modern processors? Would be interesting to check on a recent Intel architecture though.

I tested this change with the dragon scene on the Xeon in our lab and saw an improvement of 8.7% in playback performance.

Thanks, at that point and given the amount of testing and overall results, would consider that in practice this patch is a significant improvement over existing code.

Am still dubious about that low-level usage of atomics, but until proven otherwise this does not seem to be a problem, so think it can go in master. Easy enough to tweak again in case wider testing reveals performances issues.

This revision is now accepted and ready to land.Aug 12 2021, 3:06 PM
Campbell Barton (campbellbarton) retitled this revision from Mesh: optimize normal calculation (investigation) to Mesh: optimize normal calculation.Aug 13 2021, 2:20 AM
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)