Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
- Remove single threaded iteration over loops.
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed.
The performance difference depends a lot on use case and threads available.
Measurements for `BKE_mesh_calc_normals_poly`:
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001962 | 0.005820 | 0.337056 |
| regular_normals_00128k.blend | 0.003666 | 0.011120 | 0.329671 |
| regular_normals_00256k.blend | 0.007327 | 0.016328 | 0.448751 |
| regular_normals_00512k.blend | 0.014868 | 0.024971 | 0.595429 |
| regular_normals_01024k.blend | 0.029769 | 0.037430 | 0.795347 |
| regular_normals_02048k.blend | 0.109871 | 0.051841 | 2.119380 |
| regular_normals_25000k.blend | 0.517715 | 0.207094 | 2.499900 |
Tested using 32 cores (64 threads), average of 20 calls, see P2283.
----
Observations:
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
Update:
- `TaskParallelSettings.min_iter_per_thread` is set to `1024`, so tests on low-poly meshes will not saturate the CPU cores if `(cores * 1024 > poly_count)`, in my case this makes tests with meshes with < 64k polys give *noisy* results.
- Micro-optimizations in calculating normals make no significant difference.
Tested minor changes, none made any significant difference:
- Unrolling tri/quad face code-paths (removed `alloca` for edge-vectors).
- Remove edge-vectors, storing previous/current normalized edge vectors (adding one additional normalization per face).
- Replace `BLI_task` with `TBB`.
----
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
----
Thanks to @easythrees for helping investigate this patch.