Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
- Remove single threaded iteration over loops.
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed.
The performance difference depends a lot on use case and threads available.
Measurements for `BKE_mesh_calc_normals_poly`:
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001963 | 0.004893 | 0.401219 |
| regular_normals_00128k.blend | 0.003660 | 0.008474 | 0.431896 |
| regular_normals_00256k.blend | 0.007506 | 0.013773 | 0.544960 |
| regular_normals_00512k.blend | 0.014811 | 0.021998 | 0.673273 |
| regular_normals_01024k.blend | 0.029675 | 0.032579 | 0.910856 |
| regular_normals_02048k.blend | 0.109203 | 0.045100 | 2.421353 |
| regular_normals_25000k.blend | 0.513763 | 0.177814 | 2.889328 |
Tested using 32 cores (64 threads), average of 20 calls, see P2283.
----
Observations:
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
Update:
- `TaskParallelSettings.min_iter_per_thread` is set to `1024`, so tests on low-poly meshes will not saturate the CPU cores if `(cores * 1024 > poly_count)`, in my case this makes tests with meshes with < 64k polys give *noisy* results.
- Micro-optimizations in calculating normals make no significant difference, tested the following changes:
- Unrolling tri/quad face code-paths (removed `alloca` for edge-vectors).
- Remove edge-vectors, storing previous/current normalized edge vectors (adding one additional normalization per face).
- Replace `BLI_task` with `TBB`.
----
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
### Alternative Optimizations
#### In-Line Vertex Normalize (P2284)
Track the number of un-handled loops per vertex using atomic int subtraction.
This method trades a separate vertex normalization multi-threaded pass for a pass that counts the number or loops for each vertex
This can be updated using atomic operations, once there are no loops left the vector is normalized as part of accumulating (instead of a separate loop).
This reduces lock contention:
- The last addition before normalizing doesn't need to lock.
- The additional CPU cycles for inline normalization could make it less likely two threads access vectors at the same time.
With 25-million faces, counting this takes roughly 10% of the overall time.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001962 | 0.005820 | 0.337056 |
| regular_normals_00128k.blend | 0.003666 | 0.011120 | 0.329671 |
| regular_normals_00256k.blend | 0.007327 | 0.016328 | 0.448751 |
| regular_normals_00512k.blend | 0.014868 | 0.024971 | 0.595429 |
| regular_normals_01024k.blend | 0.029769 | 0.037430 | 0.795347 |
| regular_normals_02048k.blend | 0.109871 | 0.051841 | 2.119380 |
| regular_normals_25000k.blend | 0.517715 | 0.207094 | 2.499900 |
#### Per-Thread Vertex Range (P2285)
This avoids locking by each thread looping over all polygons, only operating on vertices that are in their non-overlapping vertex index range:
Comparison with `master`
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.002243 | 0.005680 | 0.394806 |
| regular_normals_00128k.blend | 0.003724 | 0.012653 | 0.294345 |
| regular_normals_00256k.blend | 0.007619 | 0.021723 | 0.350754 |
| regular_normals_00512k.blend | 0.014994 | 0.033881 | 0.442554 |
| regular_normals_01024k.blend | 0.031565 | 0.059081 | 0.534278 |
| regular_normals_02048k.blend | 0.112171 | 0.108502 | 1.033816 |
| regular_normals_25000k.blend | 0.521131 | 1.067226 | 0.488304 |
### Optimizations to the Current Method
#### Store Loop Weights Instead of Vectors (P2286)
This is only a minor change to the current behavior
Instead of storing the vectors for each loop, store their weight,
which is then used with the polygon normal when accumulating into the vertex.
On average it's slightly faster and uses less memory.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.002182 | 0.002049 | 1.065078 |
| regular_normals_00128k.blend | 0.003416 | 0.003851 | 0.887053 |
| regular_normals_00256k.blend | 0.007799 | 0.006961 | 1.120338 |
| regular_normals_00512k.blend | 0.015225 | 0.014962 | 1.017578 |
| regular_normals_01024k.blend | 0.030878 | 0.029043 | 1.063176 |
| regular_normals_02048k.blend | 0.112791 | 0.119770 | 0.941730 |
| regular_normals_25000k.blend | 0.525442 | 0.509423 | 1.031446 |
#### Store Loop Weights Instead of Vectors (+ Cache Edge Vectors) (P2287)
A version of **"Store Loop Weights Instead of Vectors"** that also caches unit length edge vectors.
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.002170 | 0.002284 | 0.949978 |
| regular_normals_00128k.blend | 0.003769 | 0.004815 | 0.782709 |
| regular_normals_00256k.blend | 0.007604 | 0.008773 | 0.866766 |
| regular_normals_00512k.blend | 0.015195 | 0.019465 | 0.780611 |
| regular_normals_01024k.blend | 0.030726 | 0.040956 | 0.750213 |
| regular_normals_02048k.blend | 0.111197 | 0.136106 | 0.816986 |
| regular_normals_25000k.blend | 0.519313 | 0.639503 | 0.812057 |
----
Thanks to @easythrees for helping investigate this patch.