Optimize mesh normal calculation.
- Remove the intermediate lnors_weighted array, accumulate directly into the normal array using a spin-lock for thread safety.
- Remove single threaded iteration over loops (normal calculation is now fully multi-threaded).
- Remove stack array (alloca) for pre-calculating edge-directions.
Summary of Performance Characteristics:
- The largest gains are for single high poly meshes, with isolated normal-calculation benchmarks of meshes over ~1.5 million showing 2x+ speedup, ~25 million polygons are ~2.85x faster.
- Single lower poly meshes (250k polys) can be ~2x slower.
Since these meshes aren't normally a bottleneck, and this problem isn't noticeable on large scenes, we considered the performance trade-off reasonable.
- The performance difference reduces with larger scenes, tests with production files from "Sprite Fight" showing the same or slightly better overall performance.
For more details & benchmarking scripts, see the patch description.
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x atomic_add_and_fetch_fl calls per vertex).
- Vertex accumulation runs in parallel.
Measurements for BKE_mesh_calc_normals_poly
Measurements for Time Between Updates (1/FPS)
This includes the time for a simple-deform modifier.
Measurements for Time Between Updates (1/FPS) 16x linked duplicates
To show how performance compares with multiple objects being calculated at once.
This includes the time for a simple-deform modifier which was used on each object.
Measurements for Time Between Updates (1/FPS) High Poly (1-10million)
This includes the time for a simple-deform modifier.
(P2291 used to generate test data, P2283 to benchmark the test data).
- Tested using 32 cores (64 threads), average of 10 calls.
- NOTE: above 1 million, steps of 100k were used instead of 10k.
Measurements with Production Files
Tested with sprite_fight r5056 (single maximized viewport, all overlays disabled except for text) using the best FPS after playing back for a few seconds.
| File | Before | After |
| pro/shots/130_snacks/130_0060_A/130_0060_A.anim.blend | 6.96 fps | 7.05 fps |
| pro/shots/110_rextoria/110_0170_A/110_0170_A.v003polish.anim.blend | 6.15 fps | 16.25 fps |
| pro/shots/110_rextoria/110_0090_A/110_0090_A.anim.blend | 5.18 fps | 5.18 fps |
Observations
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with add_v3_v3 didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (-O2, -O3, -Ofast, both GCC and CLANG gave comparable results).
- Manually limiting the threads to 4 shows much less variation in performance.
- TaskParallelSettings.min_iter_per_thread is set to 1024, so tests on low-poly meshes will not saturate the CPU cores if (cores * 1024 > poly_count), in my case this makes tests with meshes with < 64k polys give *noisy* results.
- Comparing the "Before" times, there is some noise in these tests, in practice the time comparisons were reproducible, where slowdowns and speedups were fairly close on each execution.
- At around 650k polygons, this patch starts to perform better than master.
- With multiple objects both the speedup for high-poly meshes & penalty for lower poly meshes reduce significantly.
- Micro-optimizations in calculating normals make no significant difference, tested the following changes:
- Unrolling tri/quad face code-paths (removed alloca for edge-vectors).
- Replace BLI_task with TBB.
- Use madvise to help the system hint at the kinds of memory access to expect.
Further Work (out of scope for this patch)
Noting some additional areas of investigation that I consider out of scope for this patch.
- While reducing the TaskParallelSettings.min_iter_per_thread increases performance for low-poly meshes it may cause problems in production files with many objects. I consider it out of scope for this patch since the gains are relatively minor in situations where performance typically isn't an issue, changing this value can be proposed tested & independently of this patch.
- Using fast_acos with a 0.000067 error margin gives a small but measurable improvement (exact speedup would need further testing since it's near level of noise - approx 3-6%) see P2290.
- Using single threaded normal calculation in situations where many objects are being calculated at once (removing the need for locking in situations where scenes already distribute calculations across many objects).
Alternative Optimizations
Since this can be a significant bottleneck, it's worth investigating alternative optimizations.
So far most alternatives perform worse however there are some minor improvements to the current method that could be extracted into a separate patch.
In-Line Vertex Normalize (P2284)
Track the number of un-handled loops per vertex using atomic int subtraction.
This method trades a separate vertex normalization multi-threaded pass for a pass that counts the number or loops for each vertex
This can be updated using atomic operations, once there are no loops left the vector is normalized as part of accumulating (instead of a separate loop).
This reduces lock contention:
- The last addition before normalizing doesn't need to lock.
- The additional CPU cycles for inline normalization could make it less likely two threads access vectors at the same time.
With 25-million faces, counting this takes roughly 10% of the overall time.
| File | Before | After | Speedup |
| 00064k.blend | 0.001962 | 0.005820 | 0.337056 |
| 00128k.blend | 0.003666 | 0.011120 | 0.329671 |
| 00256k.blend | 0.007327 | 0.016328 | 0.448751 |
| 00512k.blend | 0.014868 | 0.024971 | 0.595429 |
| 01024k.blend | 0.029769 | 0.037430 | 0.795347 |
| 02048k.blend | 0.109871 | 0.051841 | 2.119380 |
| 25000k.blend | 0.517715 | 0.207094 | 2.499900 |
Per-Thread Vertex Range (P2285)
This avoids locking by each thread looping over all polygons, only operating on vertices that are in their non-overlapping vertex index range:
Comparison with master
| File | Before | After | Speedup |
| 00064k.blend | 0.002243 | 0.005680 | 0.394806 |
| 00128k.blend | 0.003724 | 0.012653 | 0.294345 |
| 00256k.blend | 0.007619 | 0.021723 | 0.350754 |
| 00512k.blend | 0.014994 | 0.033881 | 0.442554 |
| 01024k.blend | 0.031565 | 0.059081 | 0.534278 |
| 02048k.blend | 0.112171 | 0.108502 | 1.033816 |
| 25000k.blend | 0.521131 | 1.067226 | 0.488304 |
Optimizations to the Current Method
Store Loop Weights Instead of Vectors (P2286)
This is only a minor change to the current behavior
Instead of storing the vectors for each loop, store their weight,
which is then used with the polygon normal when accumulating into the vertex.
On average it's slightly faster and uses less memory.
| File | Before | After | Speedup |
| 00064k.blend | 0.002182 | 0.002049 | 1.065078 |
| 00128k.blend | 0.003416 | 0.003851 | 0.887053 |
| 00256k.blend | 0.007799 | 0.006961 | 1.120338 |
| 00512k.blend | 0.015225 | 0.014962 | 1.017578 |
| 01024k.blend | 0.030878 | 0.029043 | 1.063176 |
| 02048k.blend | 0.112791 | 0.119770 | 0.941730 |
| 25000k.blend | 0.525442 | 0.509423 | 1.031446 |
Store Loop Weights Instead of Vectors (+ Cache Edge Vectors) (P2287)
A version of "Store Loop Weights Instead of Vectors" that also caches unit length edge vectors.
| File | Before | After | Speedup |
| 00064k.blend | 0.002170 | 0.002284 | 0.949978 |
| 00128k.blend | 0.003769 | 0.004815 | 0.782709 |
| 00256k.blend | 0.007604 | 0.008773 | 0.866766 |
| 00512k.blend | 0.015195 | 0.019465 | 0.780611 |
| 01024k.blend | 0.030726 | 0.040956 | 0.750213 |
| 02048k.blend | 0.111197 | 0.136106 | 0.816986 |
| 25000k.blend | 0.519313 | 0.639503 | 0.812057 |
Store Loop Weights Instead of Vectors (+ fast_acos & micro-optimizations) (P2290)
This patch uses minor optimizations to get the most out of the current method, this gives a ~13% overall speedup when averaging results, with some tests ~43% faster.
- Use fast_acos (less precise with 0.000067 error margin).
- Remove alloca, instead store previous/next edge-vectors, re-using the last to avoid an extra call to normalize_v3.
- Use one loop instead of two in mesh_calc_normals_poly_prepare_cb since the polygon normal isn't needed for calculating weights.
- Set min_iter_per_thread to 64 instead of 1024.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001792 | 0.001621 | 1.105090 |
| regular_normals_00128k.blend | 0.003182 | 0.003111 | 1.022553 |
| regular_normals_00256k.blend | 0.006736 | 0.005774 | 1.166509 |
| regular_normals_00512k.blend | 0.014230 | 0.013210 | 1.077208 |
| regular_normals_01024k.blend | 0.038111 | 0.026630 | 1.431143 |
| regular_normals_02048k.blend | 0.106946 | 0.106411 | 1.005024 |
| regular_normals_25000k.blend | 0.517087 | 0.472683 | 1.093942 |
Thanks to @Jagannadhan Ravi (easythrees) for helping investigate this patch.




