Test results on a Ryzen 3700x, 4096 x 4096 grid:
| Baseline | 659 ms | 1.0x |
| Parallelize Polys | 510 ms | 1.3x baseline |
| Parallelize Edges | 444 ms | 1.5x baseline |
| Parallelize Vertices | 407 ms | 1.6x baseline |
| Parallelize UVs | 290 ms | 2.3x baseline |
| Grain size 1024 -> 512 | 265 ms | 2.5x baseline |
For smaller grids, all this should do is increase the
code size a bit, and add in a few more if statements.
The next thing I'd test is merging some of the parallel loops
to avoid recalculation of some of the indices, and to decrease
threading overhead. I don't know if that would improve performance,
and it would be a bit more complex, so I'd rather look into that
as a separate step.
The final share of time:
| UVs | 81 ms | 31% |
| Polys & Loops | 71 ms | 27% |
| Edges | 45 ms | 17% |
| Vertices | 35 ms | 13% |
| Filling Normals | 32 ms | 12 % |