Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
- Remove single threaded iteration over loops.
- Remove stack array (alloca) for pre-calculating next/previous directions.
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed.
The performance difference depends a lot on use case and threads available.
---
### Measurements for `BKE_mesh_calc_normals_poly`
{F10258734, size=full}
---
### Measurements for Time Between Updates (`1/FPS`)
This includes the time for a simple-deform modifier.
{F10258791, size=full}
---
### Measurements for Time Between Updates (`1/FPS`) 16x linked duplicates
To show how performance compares with multiple objects being calculated at once.
This includes the time for a simple-deform modifier which was used on each object.
{F10258674, size=full}
### ### Measurements for Time Between Updates (`1/FPS`) High Poly (1-10million)
This includes the time for a simple-deform modifier.
{F10258696, size=full}
---
(P2291 used to generate test data, P2283 to benchmark the test data).
- Tested using 32 cores (64 threads), average of 10 calls.
- NOTE: above 1 million, steps of 100k were used instead of 10k.
---
Observations:
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
- Manually limiting the threads to 4 shows much less variation in performance.
- `TaskParallelSettings.min_iter_per_thread` is set to `1024`, so tests on low-poly meshes will not saturate the CPU cores if `(cores * 1024 > poly_count)`, in my case this makes tests with meshes with < 64k polys give *noisy* results.
- While reducing the `TaskParallelSettings.min_iter_per_thread` increases performance for low-poly meshes it may cause problems in production files with many objects. I consider it out of scope for this patch since the gains are relatively minor in situations where performance typically isn't an issue, changing this value can be proposed & independently of this patch.
- Comparing the "Before" times, there is some noise in these tests, in practice the time comparisons were reproducible, where slowdowns and speedups were fairly close on each execution.
- At around 650k polygons, this patch starts to perform better than master.
- With multiple objects both the speedup for high-poly meshes & penalty for lower poly meshes reduce significantly.
- Micro-optimizations in calculating normals make no significant difference, tested the following changes:
- Unrolling tri/quad face code-paths (removed `alloca` for edge-vectors).
- Replace `BLI_task` with `TBB`.
- Use `madvise` to help the system hint at the kinds of memory access to expect.
----
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
### Alternative Optimizations
NOTE: None of the alternatives are especially promising, keeping for reference.
Since this can be a significant bottleneck, it's worth investigating alternative optimizations.
So far most alternatives perform worse however there are some minor improvements to the current method that could be extracted into a separate patch.
#### In-Line Vertex Normalize (P2284)
Track the number of un-handled loops per vertex using atomic int subtraction.
This method trades a separate vertex normalization multi-threaded pass for a pass that counts the number or loops for each vertex
This can be updated using atomic operations, once there are no loops left the vector is normalized as part of accumulating (instead of a separate loop).
This reduces lock contention:
- The last addition before normalizing doesn't need to lock.
- The additional CPU cycles for inline normalization could make it less likely two threads access vectors at the same time.
With 25-million faces, counting this takes roughly 10% of the overall time.
| File | Before | After | Speedup |
| 00064k.blend | 0.001962 | 0.005820 | 0.337056 |
| 00128k.blend | 0.003666 | 0.011120 | 0.329671 |
| 00256k.blend | 0.007327 | 0.016328 | 0.448751 |
| 00512k.blend | 0.014868 | 0.024971 | 0.595429 |
| 01024k.blend | 0.029769 | 0.037430 | 0.795347 |
| 02048k.blend | 0.109871 | 0.051841 | 2.119380 |
| 25000k.blend | 0.517715 | 0.207094 | 2.499900 |
#### Per-Thread Vertex Range (P2285)
This avoids locking by each thread looping over all polygons, only operating on vertices that are in their non-overlapping vertex index range:
Comparison with `master`
| File | Before | After | Speedup |
| 00064k.blend | 0.002243 | 0.005680 | 0.394806 |
| 00128k.blend | 0.003724 | 0.012653 | 0.294345 |
| 00256k.blend | 0.007619 | 0.021723 | 0.350754 |
| 00512k.blend | 0.014994 | 0.033881 | 0.442554 |
| 01024k.blend | 0.031565 | 0.059081 | 0.534278 |
| 02048k.blend | 0.112171 | 0.108502 | 1.033816 |
| 25000k.blend | 0.521131 | 1.067226 | 0.488304 |
### Optimizations to the Current Method
#### Store Loop Weights Instead of Vectors (P2286)
This is only a minor change to the current behavior
Instead of storing the vectors for each loop, store their weight,
which is then used with the polygon normal when accumulating into the vertex.
On average it's slightly faster and uses less memory.
| File | Before | After | Speedup |
| 00064k.blend | 0.002182 | 0.002049 | 1.065078 |
| 00128k.blend | 0.003416 | 0.003851 | 0.887053 |
| 00256k.blend | 0.007799 | 0.006961 | 1.120338 |
| 00512k.blend | 0.015225 | 0.014962 | 1.017578 |
| 01024k.blend | 0.030878 | 0.029043 | 1.063176 |
| 02048k.blend | 0.112791 | 0.119770 | 0.941730 |
| 25000k.blend | 0.525442 | 0.509423 | 1.031446 |
#### Store Loop Weights Instead of Vectors (+ Cache Edge Vectors) (P2287)
A version of **"Store Loop Weights Instead of Vectors"** that also caches unit length edge vectors.
| File | Before | After | Speedup |
| 00064k.blend | 0.002170 | 0.002284 | 0.949978 |
| 00128k.blend | 0.003769 | 0.004815 | 0.782709 |
| 00256k.blend | 0.007604 | 0.008773 | 0.866766 |
| 00512k.blend | 0.015195 | 0.019465 | 0.780611 |
| 01024k.blend | 0.030726 | 0.040956 | 0.750213 |
| 02048k.blend | 0.111197 | 0.136106 | 0.816986 |
| 25000k.blend | 0.519313 | 0.639503 | 0.812057 |
#### Store Loop Weights Instead of Vectors (+ `fast_acos` & micro-optimizations) (P2290)
This patch uses minor optimizations to get the most out of the current method, this gives a ~13% overall speedup when averaging results, with some tests ~43% faster.
- Use `fast_acos` (less precise with `0.000067` error margin).
- Remove `alloca`, instead store previous/next edge-vectors, re-using the last to avoid an extra call to `normalize_v3`.
- Use one loop instead of two in `mesh_calc_normals_poly_prepare_cb` since the polygon normal isn't needed for calculating weights.
- Set `min_iter_per_thread` to 64 instead of 1024.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001792 | 0.001621 | 1.105090 |
| regular_normals_00128k.blend | 0.003182 | 0.003111 | 1.022553 |
| regular_normals_00256k.blend | 0.006736 | 0.005774 | 1.166509 |
| regular_normals_00512k.blend | 0.014230 | 0.013210 | 1.077208 |
| regular_normals_01024k.blend | 0.038111 | 0.026630 | 1.431143 |
| regular_normals_02048k.blend | 0.106946 | 0.106411 | 1.005024 |
| regular_normals_25000k.blend | 0.517087 | 0.472683 | 1.093942 |
----
Thanks to @easythrees for helping investigate this patch.