Page MenuHome

Cycles: Make spatial split BVH multi-threaded
ClosedPublic

Authored by Sergey Sharybin (sergey) on Feb 22 2016, 6:20 PM.

Details

Summary

The title actually covers it all, This commit exploits all the work
being done in previous changes to make it possible to build spatial
splits in threads.

Works quite nicely, but has a downside of some extra memory usage.
This extra memory usage is partially solvable, but partially it's
something we'll have to simply accept.

Things to take care about before it could be considered final and
merged:

  • Thread tasks do have a coy of BVHRange now, which makes it so builder uses more memory, even comparing with the regular BVH builder.

    Idea here would be to limit amount of tasks which do exist in the scheduling queue. Implementing this is quite tricky tho.
  • Leaf creating uses quite bad policy of reserving memory now. Need to think of something smarter to approximate overall count of leaves more accurate.

    We really should minimize amount of re-allocations here, so being not really accurate on approximation has performance benefits here.
  • Progress update is done differently from the regular BVH builder so far. We can make it to match tho.

Diff Detail

Repository
rB Blender

Event Timeline

Here's some side-by-side comparison of timing before and after the changes

1 master patch
2
3## BMW
4 Build time: 0.00105596 Build time: 0.00111604
5 Build time: 0.0116301 Build time: 0.013335
6 Build time: 0.0520229 Build time: 0.052249
7 Build time: 0.25269 Build time: 0.18668
8 Build time: 1.83551 Build time: 1.01374
9 Build time: 2.45955 Build time: 1.02784
10 Build time: 0.204987 Build time: 0.179392
11
12## Agent
13 Build time: 2.3522 Build time: 0.482608
14 Build time: 92.2917 Build time: 4.39739
15 Build time: 0.874709 Build time: 0.771483

It's really specious agent scene is like 20x faster on 8 threads CPU -- it's almost same exact number of leaf nodes (threaded generated 16 more leaves, why so is to be investigated). But my theory is that it's because of much lower number of re-allocations. Comparing heat maps of BVH traversal didn't show measurable differences in threaded and non-threaded built BVH.

No time to review today, but great to see this tackled.

For those timings, am I understanding correctly that for BMW scene there are multiple smaller BVHs being built in parallel, so it already benefitted from multithreading, but it's even faster now. While the agent scene was pretty much single threaded due to the single big BVH, and so the speedup is much bigger?

@Brecht Van Lommel (brecht), yes. But having more than 8 times speedup on a CPU with 4 cores 8 threads is still suspicious. I _think_ it's due to re-allocation policy which now over-allocates quite a bit (up to 2 times) but reduces number of re-allocations. Other changes (sorting indices, stack allocator) might helping here as well and cause non-linearity in the speed gain comparing to master.

intern/cycles/util/util_task.h
128–129 ↗(On Diff #6162)

At some point early on I used thread local storage in Cycles in various places in the kernel, but removed it because it turned out to have a performance impact on some platforms (at least OS X if I remember correctly).

It might have been the specific implementation, or maybe compilers have improved, but I would at least add a warning to use this with care, even if it's nice not having to pass around the thread id as a parameter everywhere.

The tls code was removed entirely in rB7c0a0bae79bb: Fix #33375: OSL geom:trianglevertices gave wrong coordinates for static BVH..

@Brecht Van Lommel (brecht), OSX again :S But that's interesting, because OSL/OIIO are using TLS quite a lot. That would be quite tricky to have a squeeze how exactly they used it, since it's all wrapped by Boost. Alternative could be to let thread callback to have optional int thread_id argument, but that'd cause some chain-reaction changes :S

Will run tests here on iTrash and see how it all works. However, from a quick squeeze into your patch, you used pthread's local storage which is likely same buggy still. This patch uses __thread attribute which as far as i concerned should work for Clang and Gcc. For MSVC we might wannna to tweak it to __declspec(thread). So maybe it's not all lost.. :)

OSL/OIIO only use TLS if you do not pass in per thread info yourself, so that the API is simpler if you don't need maximum performance. But if you do, you need to manually handle per thread info and not rely on TLS. So I think they found similar performance issues, though theoretically I don't think TLS has to be slow and maybe the implementations have improved now.

@Brecht Van Lommel (brecht), it's a bit more complicated in OSL actually. It will use TLS for a runtime optimization. In fact, this caused issues for us (TLS was not freeing properly before ccl::Session exists, causing access to a freed memory). That was a reason why we switched to ahead-of-a-time greedyjit optimization now.

Right, I was referring to operations like shader evaluations and texture lookups that you do very often.

Besides the further work you already mention and the TODO comments in the code, I'm having difficulty finding anything to comment on. What's there now looks correct to me.

intern/cycles/bvh/bvh_build.cpp
590–603

This can be be const int MAX_ITEMS_PER_LEAF = 16;

This revision was automatically updated to reflect the committed changes.