Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
into the normal array using a spin-lock for thread safety.
- Remove single threaded iteration over loops.
(normal calculation is now fully multi-threaded).
- Remove stack array (alloca) for pre-calculating edge-directions.
Summary of Performance Characteristics:
- The largest gains are for single high poly meshes, with isolated
normal-calculation benchmarks of meshes over ~1.5 million showing
2x+ speedup, ~25 million polygons are ~2.85x faster.
- Single lower poly meshes (250k polys) can be ~2x slower.
Since these meshes aren't normally a bottleneck,
- Remove stack array (alloca) for pre-calculating next/previous directions. and this problem isn't noticeable on large scenes,
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed we considered the performance trade-off reasonable.
The performance difference depends a lot on use case and threads availabl- The performance difference reduces with larger scenes,
tests with production files from "Sprite Fight" showing
the same or slightly better overall performance.
NOTE: tested on a AMD Ryzen TR 3970X 32-Core.
For more details & benchmarking scripts, see the patch description.
---
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
---
### Measurements for `BKE_mesh_calc_normals_poly`
{F10258734, size=full}
---
### Measurements for Time Between Updates (`1/FPS`)
This includes the time for a simple-deform modifier.
{F10258791, size=full}
---
### Measurements for Time Between Updates (`1/FPS`) 16x linked duplicates
To show how performance compares with multiple objects being calculated at once.
This includes the time for a simple-deform modifier which was used on each object.
{F10258674, size=full}
### ### Measurements for Time Between Updates (`1/FPS`) High Poly (1-10million)
This includes the time for a simple-deform modifier.
{F10258696, size=full}
---
(P2291 used to generate test data, P2283 to benchmark the test data).
- Tested using 32 cores (64 threads), average of 10 calls.
- NOTE: above 1 million, steps of 100k were used instead of 10k.
### Measurements with Production Files
Tested with `sprite_fight` r5056 (single maximized viewport, all overlays disabled except for text) using the best FPS after playing back for a few seconds.
| File | Before | After |
| `pro/shots/130_snacks/130_0060_A/130_0060_A.anim.blend` | 6.96 fps | 7.05 fps |
| `pro/shots/110_rextoria/110_0170_A/110_0170_A.v003polish.anim.blend` | 6.15 fps | 16.25 fps |
| `pro/shots/110_rextoria/110_0090_A/110_0090_A.anim.blend` | 5.18 fps | 5.18 fps |
---
### Observations
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
- Manually limiting the threads to 4 shows much less variation in performance.
- `TaskParallelSettings.min_iter_per_thread` is set to `1024`, so tests on low-poly meshes will not saturate the CPU cores if `(cores * 1024 > poly_count)`, in my case this makes tests with meshes with < 64k polys give *noisy* results.
- Comparing the "Before" times, there is some noise in these tests, in practice the time comparisons were reproducible, where slowdowns and speedups were fairly close on each execution.
- At around 650k polygons, this patch starts to perform better than master.
- With multiple objects both the speedup for high-poly meshes & penalty for lower poly meshes reduce significantly.
- Micro-optimizations in calculating normals make no significant difference, tested the following changes:
- Unrolling tri/quad face code-paths (removed `alloca` for edge-vectors).
- Replace `BLI_task` with `TBB`.
- Use `madvise` to help the system hint at the kinds of memory access to expect.
---
### Further Work (out of scope for this patch)
Noting some additional areas of investigation that I consider out of scope for this patch.
- While reducing the `TaskParallelSettings.min_iter_per_thread` increases performance for low-poly meshes it may cause problems in production files with many objects. I consider it out of scope for this patch since the gains are relatively minor in situations where performance typically isn't an issue, changing this value can be proposed tested & independently of this patch.
- Using `fast_acos` //with a `0.000067` error margin// gives a small but measurable improvement (exact speedup would need further testing since it's near level of noise - approx 3-6%) see P2290.
### Alternative Optimizations
NOTE: None of the alternatives are especially promising, keeping for reference.
Since this can be a significant bottleneck, it's worth investigating alternative optimizations.
So far most alternatives perform worse however there are some minor improvements to the current method that could be extracted into a separate patch.
#### In-Line Vertex Normalize (P2284)
Track the number of un-handled loops per vertex using atomic int subtraction.
This method trades a separate vertex normalization multi-threaded pass for a pass that counts the number or loops for each vertex
This can be updated using atomic operations, once there are no loops left the vector is normalized as part of accumulating (instead of a separate loop).
This reduces lock contention:
- The last addition before normalizing doesn't need to lock.
- The additional CPU cycles for inline normalization could make it less likely two threads access vectors at the same time.
With 25-million faces, counting this takes roughly 10% of the overall time.
| File | Before | After | Speedup |
| 00064k.blend | 0.001962 | 0.005820 | 0.337056 |
| 00128k.blend | 0.003666 | 0.011120 | 0.329671 |
| 00256k.blend | 0.007327 | 0.016328 | 0.448751 |
| 00512k.blend | 0.014868 | 0.024971 | 0.595429 |
| 01024k.blend | 0.029769 | 0.037430 | 0.795347 |
| 02048k.blend | 0.109871 | 0.051841 | 2.119380 |
| 25000k.blend | 0.517715 | 0.207094 | 2.499900 |
#### Per-Thread Vertex Range (P2285)
This avoids locking by each thread looping over all polygons, only operating on vertices that are in their non-overlapping vertex index range:
Comparison with `master`
| File | Before | After | Speedup |
| 00064k.blend | 0.002243 | 0.005680 | 0.394806 |
| 00128k.blend | 0.003724 | 0.012653 | 0.294345 |
| 00256k.blend | 0.007619 | 0.021723 | 0.350754 |
| 00512k.blend | 0.014994 | 0.033881 | 0.442554 |
| 01024k.blend | 0.031565 | 0.059081 | 0.534278 |
| 02048k.blend | 0.112171 | 0.108502 | 1.033816 |
| 25000k.blend | 0.521131 | 1.067226 | 0.488304 |
### Optimizations to the Current Method
#### Store Loop Weights Instead of Vectors (P2286)
This is only a minor change to the current behavior
Instead of storing the vectors for each loop, store their weight,
which is then used with the polygon normal when accumulating into the vertex.
On average it's slightly faster and uses less memory.
| File | Before | After | Speedup |
| 00064k.blend | 0.002182 | 0.002049 | 1.065078 |
| 00128k.blend | 0.003416 | 0.003851 | 0.887053 |
| 00256k.blend | 0.007799 | 0.006961 | 1.120338 |
| 00512k.blend | 0.015225 | 0.014962 | 1.017578 |
| 01024k.blend | 0.030878 | 0.029043 | 1.063176 |
| 02048k.blend | 0.112791 | 0.119770 | 0.941730 |
| 25000k.blend | 0.525442 | 0.509423 | 1.031446 |
#### Store Loop Weights Instead of Vectors (+ Cache Edge Vectors) (P2287)
A version of **"Store Loop Weights Instead of Vectors"** that also caches unit length edge vectors.
| File | Before | After | Speedup |
| 00064k.blend | 0.002170 | 0.002284 | 0.949978 |
| 00128k.blend | 0.003769 | 0.004815 | 0.782709 |
| 00256k.blend | 0.007604 | 0.008773 | 0.866766 |
| 00512k.blend | 0.015195 | 0.019465 | 0.780611 |
| 01024k.blend | 0.030726 | 0.040956 | 0.750213 |
| 02048k.blend | 0.111197 | 0.136106 | 0.816986 |
| 25000k.blend | 0.519313 | 0.639503 | 0.812057 |
#### Store Loop Weights Instead of Vectors (+ `fast_acos` & micro-optimizations) (P2290)
This patch uses minor optimizations to get the most out of the current method, this gives a ~13% overall speedup when averaging results, with some tests ~43% faster.
- Use `fast_acos` (less precise with `0.000067` error margin).
- Remove `alloca`, instead store previous/next edge-vectors, re-using the last to avoid an extra call to `normalize_v3`.
- Use one loop instead of two in `mesh_calc_normals_poly_prepare_cb` since the polygon normal isn't needed for calculating weights.
- Set `min_iter_per_thread` to 64 instead of 1024.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001792 | 0.001621 | 1.105090 |
| regular_normals_00128k.blend | 0.003182 | 0.003111 | 1.022553 |
| regular_normals_00256k.blend | 0.006736 | 0.005774 | 1.166509 |
| regular_normals_00512k.blend | 0.014230 | 0.013210 | 1.077208 |
| regular_normals_01024k.blend | 0.038111 | 0.026630 | 1.431143 |
| regular_normals_02048k.blend | 0.106946 | 0.106411 | 1.005024 |
| regular_normals_25000k.blend | 0.517087 | 0.472683 | 1.093942 |
----
Thanks to @easythrees for helping investigate this patch.