Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
- Remove single threaded iteration over loops.
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed.
The performance difference depends a lot on use case and threads available.
Measurements for `BKE_mesh_calc_normals_poly`:
| File | Before | After | Speedup |
| regular_normals_00064k.blend | 0.001963 | 0.004893 | 0.401219 |
| regular_normals_00128k.blend | 0.003660 | 0.008474 | 0.431896 |
| regular_normals_00256k.blend | 0.007506 | 0.013773 | 0.544960 |
| regular_normals_00512k.blend | 0.014811 | 0.021998 | 0.673273 |
| regular_normals_01024k.blend | 0.029675 | 0.032579 | 0.910856 |
| regular_normals_02048k.blend | 0.109203 | 0.045100 | 2.421353 |
| regular_normals_25000k.blend | 0.513763 | 0.177814 | 2.889328 |
Tested using 32 cores (64 threads), average of 20 calls, see P2283.
----
Observations:
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
Update:
- `TaskParallelSettings.min_iter_per_thread` is set to `1024`, so tests on low-poly meshes will not saturate the CPU cores if `(cores * 1024 > poly_count)`, in my case this makes tests with meshes with < 64k polys give *noisy* results.
- Micro-optimizations in calculating normals make no significant difference.
Tested minor changes, none made any significant difference:
- Unrolling tri/quad face code-paths (removed `alloca` for edge-vectors).
- Remove edge-vectors, storing previous/current normalized edge vectors (adding one additional normalization per face).
- Replace `BLI_task` with `TBB`.
----
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
----
Thanks to @easythrees for helping investigate this patch.