Page MenuHome

Mikktspace: Optimized port to C++
ClosedPublic

Authored by Lukas Stockner (lukasstockner97) on Aug 1 2022, 3:48 AM.
Tags
None
Tokens
"Love" token, awarded by Andrea_Monzini."Like" token, awarded by CreatorSiSo."Love" token, awarded by Raimund58."Manufacturing Defect?" token, awarded by duarteframos."Love" token, awarded by Archivist15."Love" token, awarded by erickblender."Love" token, awarded by Dangry."Love" token, awarded by Roggii.

Details

Summary

This patch is a big overhaul to the Mikktspace module, which is used to compute tangents.
I'm not calling it a rewrite since it's the result of a lot of iterations on the original code, but pretty much everything is reworked somehow.

Overall goal was to a) make it faster and b) make it maintainable.

Notable changes:

  • Since the callbacks for requesting geometry data were a big bottleneck before, I've ported it to C++ and made it header-only, templating on the data source. That way, the compiler generates code specific to the caller, which allows it to inline the data source and specialize for some cases (e.g. subd vs. non-subd in Cycles).
  • The one input parameter, an optional angle threshold, was not used anywhere. Turns out that removing it allows for considerable algorithmic simplification, removing a lot of the complexity in the later stages. Therefore, I've just removed the option in the new code.
  • The code computes several outputs, but only one (the tangent itself) is ever used. I've kept the code to compute the others, but put them behind a preprocessor define so that they don't have any performance impact for now but could be brought back in the future if ever needed.
  • Even with the inlined data source, it turns out that keeping a local copy of the mesh data still provides considerable speedup (~30%ish iirc), so the code copies the data locally for now. This can be turned off using another preprocessor define, but since I've removed some memory requirements elsewhere, this is probably fine to keep.
  • The original code had fallback paths for many steps in case temporary memory allocation fails, but that never actually gets used anyways since malloc() doesn't really ever return NULL in practise, so I removed them.
  • In general, I've restructured A LOT of the code to make the algorithms clearer and make use of some C++ features (vectors, std::array, booleans, classes), though there's still a lot of cleanup that could be done.

As for results: For a test scene, tangent build time went from 0.91sec to 0.30sec for me. One case where this really helps is Eevee viewport performance, since anything involving a normal map will compute tangents. For another test in Eevee, viewport FPS went from 2.3 to 3.3 (viewport engine without tangents is 5.9). Finally, the test case from T97378 that used to be 6.64sec and went down to 4.92sec with D14675 is now 2.24sec.

One major thing that could still be done is parallelization, but a) I'm not sure how to do that cleanly since this is used in both Cycles and Blender, so I don't want to just throw some OpenMP in there (probably a RunParallel callback could work) and b) Blender already parallelizes across meshes so I didn't do it yet. I guess parallelizing across meshes in Cycles would be a good next step and should be easy.

Also, considering how many corner cases there are in this algorithm, some testing certainly wouldn't hurt. All the existing tests run fine at least, and I didn't see any differences in my tests.

Not sure about reviewers, so I just checked the file history. Please add/remove as appropriate.

Diff Detail

Repository
rB Blender
Branch
mikktspace (branched from master)
Build Status
Buildable 23218
Build 23218: arc lint + arc unit

Event Timeline

Lukas Stockner (lukasstockner97) requested review of this revision.Aug 1 2022, 3:48 AM
Lukas Stockner (lukasstockner97) created this revision.

Also, I wasn't sure how to create a header-only library in the build system, so I added a dummy compilation unit. There's probably a better way to do it?

Nice speedup! Just reading out of curiosity and noticed a couple things.

source/blender/blenkernel/intern/mesh_tangent.cc
39–81

The style guide mentions that member variables should be declared before functions.

Also, I'm not sure if there's a reason you used uint, but the style guide mentions to use signed integers and just use assertions to check for non-negative values if necessary.

94

BTW, the timer is usually created with a macro SCOPED_TIMER(__func__); which takes care of the naming for you

Sergey Sharybin (sergey) requested changes to this revision.EditedAug 1 2022, 9:58 AM

Nice to see this is being tackled!

Adding an empty translation unit will trigger compilation warning about translation unit not defining symbols. Not adding CMake target (as suggested in the linked commit) has another drawback: it makes it impossible to navigate into the source from an IDE.

The proper modern CMake way of adding header-only library is to add_library(my_library INTERFACE foo.h), but that requires CMake 3.19. What I'm wondering is whether with our CMake version you can add_library(my_library INTERFACE) followed with target_sources(my_library foo.h).

If the above can not be made to work quickly, think it would be fine to simple add some dummy function definition in the dummy.c and re-iterate over it once we can use newer CMake.

Did a very quick pass of review without even having time to compile it. Some simple-to-address points are inlined.
Overall looks good, but someone still want to look more carefully :)

Once thing I'd suggest is to tackle C-to-C++ transition of files from blenkernel as a separate commit. But if it is hard to do at this point then I'm not so much fussed about it.

I've started the build to see if all platforms are happy: https://builder.blender.org/admin/#/builders/18/builds/581

EDIT: The build has failed, so this is to be addressed as well.

intern/cycles/blender/mesh.cpp
64

Here and in other places suggest to if constexpr (is_subd)

112

It might worth specializing the implementation further, avoiding per-vertex branching.
Maybe add a TODO, as it seems something interesting to investigate but is not required for the initial implementation.

intern/mikktspace/float3.hpp
9–10

Either use #pragma once as in the rest of Blender (as per style guide even AFAIR), or use less generic guard like MIKKSPACE_FLOAT3_H_.

intern/mikktspace/mikktspace.hpp
14

If there is really good reason to do it globally please add comment about it. Otherwise move inside of namespace mikk.

16

How much more memory we are talking about? Is it going to be an entire copy of mesh?

19–20

In all this time we never planned to use those. I think for the simplicity we can remove related code.

This revision now requires changes to proceed.Aug 1 2022, 9:58 AM

Regarding CMake, D12122: CMake : Move to more "modern cmake" use. will address this. See how libatomic is handled there, it only makes the source files visible in the IDE with CMake 3.19+.

One major thing that could still be done is parallelization, but a) I'm not sure how to do that cleanly since this is used in both Cycles and Blender, so I don't want to just throw some OpenMP in there (probably a RunParallel callback could work) and b) Blender already parallelizes across meshes so I didn't do it yet. I guess parallelizing across meshes in Cycles would be a good next step and should be easy.

Both Cycles and Blender use TBB, so that is ok to use here. Ideally done in such a way that it works with and without TBB (see BLI_task.hh for example of how to fallback to serial functions).

intern/cycles/scene/mesh.cpp
538

Remove debug print.

  • Addressed code review
  • Updated CMake to follow D12122 atomic approach
  • Removed leftover debugging code (timing stuff etc.)
  • Split out C++ conversion to D15636 (still included here as separate commit for now)
  • Fixed build issues (I think)

Regarding uint (@Hans Goudey (HooglyBoogly)): I don't have a strong opinion there, iirc there were some sign conversion errors in Mikktspace to I just changed the entire file to use uint, and consequently also made the Mesh interface use uint. I can change that, but from what I can see in the style guide, there's no preference for signed - it just says to not use unsigned as an implicit assert, which the code doesn't to afaik.

Regarding memory (@Sergey Sharybin (sergey)): I'll collect some actual memory figures, but yeah, it does keep the entire mesh (to be precise: position, normal and UV for each loop) in memory. Not great.
The problem is the "find which loops are identical" part. For positions, we could take a shortcut by comparing vertex index, but it's not that straightforward for normals and UVs. We could use 64-bit hashes to reduce collision probability, but we still need the proper comparison for all entries that actually are equal (unless we consider the collision probability low enough to skip the check, which is not great).
I'll look further into whether we really need the copy for performance.

Regarding signed vs. unsigned integers: my reading of the style guide is that there is a preference for signed integers, except for "bit manipulations and modular arithmetic". This is said more explicitly: "Only use int and char of the builtin integer types". I think the consistency there is helpful to avoid the sign conversion errors you mentioned.

Updated version with tons of changes.

I've tried to get rid of the mesh data cache without hurting performance, and it's looking quite good.

The previous version runs a test (40 repetitions of the tangent building on my laptop) in 11.13sec, the new version does it without the mesh copy in 10.80sec.
Additionally, it parallelizes most operations, which brings the same test down to 5.05sec (on 4 cores).

For some more detail, the main changes are:

  • Parallelized tangent space accumulation, triangle tangent calculation, degenerate case handling and neighbor detection.
  • Optimized tangent space accumulation (removed redundant position fetches, used faster acosf)
  • Optimized neighbor detection (faster radixsort, explicit edge tracking, smaller sort key)
  • Optimized identical vertices detection: Instead of the hashing+sorting+block-processing approach, it's much simpler to reformulate it as a map with a custom hash and equality function. You just insert vertices, and if there's an entry already, you found a duplicate. For performance, I've tried many different implementations with and without support for parallelism (std::unordered_map, libcuckoo, junction, parallel-hashmap, tbb::concurrent_hash_map, robin_hood::unordered_flat_map), but by far the best performance was with Facebook's AtomicHashMap. So, I've created a super-simplified version of it (standalone, header-only, no find or erase, no resize etc.) and included it here. Turns out it doesn't even need to be a map, just a set, so it's basically just a fancy wrapper around a basic compare-exchange-based hash set now.
  • MANY further cleanups
  • Removed mesh data caching

Regarding uint: I still need to check how to do this cleanly - the problem with using signed integers is that every operator[] on a vector gives a sign error since they expect a unsigned size_t. Using size_t throughout the code would need more memory and make everything slower for no good reason, since the Blender-side mesh uses 32-bit integers anyways. So, I'd have to do explicit sign conversion for every vector access, which is not great.

Brecht Van Lommel (brecht) requested changes to this revision.Aug 23 2022, 8:42 PM

This looks generally fine now.

I don't think I need to review this in much detail if you've checked that it gives matching results on some more complex meshes.

intern/mikktspace/mikk_atomic_hash_set.hh
38 ↗(On Diff #54866)

This doesn't seem to be used.

intern/mikktspace/mikktspace.hh
17 ↗(On Diff #54866)

We try to keep Blender building even when TBB is not available, so this should do the same. See BLI_task.hh for how we do this in Blender.

This revision now requires changes to proceed.Aug 23 2022, 8:42 PM
  • Rebased
  • Added support for building without TBB (note: Cycles does not seem to pull in the WITH_TBB define automatically, I had to explciitly add it. Not sure why.)
  • Removed unused asm_volatile_pause
  • Ran tests comparing the output tangents to those produced by the old code, revealed two issues below. With these fixed, tangents match down to 1e-5 now.
  • Fixed missing atomic add in the group tangent accumulation, could cause wrong tangents at groups that were shared between vertices on different threads.
  • Fixed incorrectly using the serial specialization for duplicate detection on smaller meshes despite running in parallel, could cause missed neighbors and therefore wrong tangents.
intern/mikktspace/mikktspace.hh
220 ↗(On Diff #55297)

Should this be a duplicate of threading::parallel_for from BLI_task.hh? It seems that this has already been mentioned, but it should be added that this is not an example, but a unified interface. Whenever possible, you should use threading::parallel_for in C++ instead of recreating it. Also, not sure about tbb::this_task_arena::max_concurrency();, does it seem like it's already set somewhere?

intern/mikktspace/mikktspace.hh
220 ↗(On Diff #55297)

Oh, I didn't notice, this doesn't use BLI, sorry for the extra comment

Please go over my inlined comments.

P.S. Gave it a try on the buildbot: https://builder.blender.org/admin/#/builders/136/builds/16 and it compiled. So not sure why D15636 failed on the buildbot. Might be some difference ins code between 2 branches?

Looks fine to me now.

@Sergey Sharybin (sergey), your earlier inlined comments seem to be addressed?

@Brecht Van Lommel (brecht) Indeed. Not sure what made phabricator to show some older state of code and comments =\

This revision is now accepted and ready to land.Sep 5 2022, 2:59 PM

So happy to see MikkTspace get optimised!

Just wanted to note for the standalone Cycles this would be great to turn into a third_party library.

Cheers

So happy to see MikkTspace get optimised!

Just wanted to note for the standalone Cycles this would be great to turn into a third_party library.

Cheers

intern/mikktspace should already be standalone (the only non-stdlib dependency is TBB and that's optional), so this is mostly a build system question I guess.