Optimize mesh normal calculation.
- Remove the intermediate `lnors_weighted` array, accumulate directly into the normal array using a lock for thread safety.
- Remove single threaded iteration over loops.
- Final normalization is now done in-line, using an array of remaining loops to detect the last vertex accessed.
The performance difference depends a lot on use case and threads available.
Rough measurements for `BKE_mesh_calc_normals_poly`:
- 2.92x faster with ~25 million quads.
- 2.7x faster with ~2 million quads.
- 1.3x slower with 512k quads.
- 1.6x slower with 256k quads.
- 2.26x slower with 128k quads.
- 1.5x slower with 16k quads.
- 1.15x slower with 8k quads.
Tested using 32 cores (64 threads).
----
Observations:
- In my tests the spin-lock was almost never waiting, roughly ~0.01% of additions.
- The overhead of locking was negligible (replacing with `add_v3_v3` didn't make a significant difference to performance).
- Changing optimization flags didn't make a significant difference (`-O2`, `-O3`, `-Ofast`, both GCC and CLANG gave comparable results).
----
Posting this for review as this reverts rBd130c66db436b1fccbbde040839bc4cb5ddaacd2,
The only significant things I can see that are different in this patch compared to the code before rBd130c66db436b1fccbbde040839bc4cb5ddaacd2 was applied are.
- Lock the entire vector before adding (instead of 3x `atomic_add_and_fetch_fl` calls per vertex).
- Vertex accumulation runs in parallel.
----
Thanks to @easythrees for helping investigate this patch.