This patch is a big overhaul to the Mikktspace module, which is used to compute tangents.
I'm not calling it a rewrite since it's the result of a lot of iterations on the original code, but pretty much everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big bottleneck before, I've ported it to C++ and made it header-only, templating on the data source. That way, the compiler generates code specific to the caller, which allows it to inline the data source and specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used anywhere. Turns out that removing it allows for considerable algorithmic simplification, removing a lot of the complexity in the later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself) is ever used. I've kept the code to compute the others, but put them behind a preprocessor define so that they don't have any performance impact for now but could be brought back in the future if ever needed.
- Even with the inlined data source, it turns out that keeping a local copy of the mesh data still provides considerable speedup (~30%ish iirc), so the code copies the data locally for now. This can be turned off using another preprocessor define, but since I've removed some memory requirements elsewhere, this is probably fine to keep.
- The original code had fallback paths for many steps in case temporary memory allocation fails, but that never actually gets used anyways since malloc() doesn't really ever return NULL in practise, so I removed them.
- In general, I've restructured A LOT of the code to make the algorithms clearer and make use of some C++ features (vectors, std::array, booleans, classes), though there's still a lot of cleanup that could be done.
As for results: For a test scene, tangent build time went from 0.91sec to 0.30sec for me. One case where this really helps is Eevee viewport performance, since anything involving a normal map will compute tangents. For another test in Eevee, viewport FPS went from 2.3 to 3.3 (viewport engine without tangents is 5.9). Finally, the test case from T97378 that used to be 6.64sec and went down to 4.92sec with D14675 is now 2.24sec.
One major thing that could still be done is parallelization, but a) I'm not sure how to do that cleanly since this is used in both Cycles and Blender, so I don't want to just throw some OpenMP in there (probably a RunParallel callback could work) and b) Blender already parallelizes across meshes so I didn't do it yet. I guess parallelizing across meshes in Cycles would be a good next step and should be easy.
Also, considering how many corner cases there are in this algorithm, some testing certainly wouldn't hurt. All the existing tests run fine at least, and I didn't see any differences in my tests.
Not sure about reviewers, so I just checked the file history. Please add/remove as appropriate.