Page MenuHome

Task: Use TBB as Task Scheduler
ClosedPublic

Authored by Jeroen Bakker (jbakker) on Apr 20 2020, 2:19 PM.

Details

Summary

This patch enables TBB as the default task scheduler. TBB stands for Threading Building Blocks and is developed by Intel. The library contains several threading patters. This patch maps blenders BLI_task_* function to their counterpart. After this patch we can add more patterns. A promising one is TBB:graph that can be used for depsgraph, draw manager and compositor.

Performance changes depends on the actual hardware. It was tested on different hardwares from laptops to workstations and we didn't detected any downgrade of the performance.

  • Linux Xeon E5-2699 v4 got FPS boost from 12 to 17 using Spring's 04_010_A.anim.blend.
  • AMD Ryzen Threadripper 2990WX 32-Core Animation playback goes from 9.5-10.5 FPS to 13.0-14.0 FPS on Agent 327 , 10_03_B.anim.blend.

Future work

Diff Detail

Repository
rB Blender

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jeroen Bakker (jbakker) requested review of this revision.Apr 20 2020, 2:19 PM
ogierm added a subscriber: ogierm.Apr 20 2020, 4:17 PM
Jeroen Bakker (jbakker) updated this revision to Diff 23907.EditedApr 20 2020, 4:31 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)
  • Task: Enable TBB for draw mesh extract
  • Task: Enabled TBB for projection painting
  • Task: Enable TBB for Depsgraph evaluation
  • Task: Enable TBB for iterators
  • Task: Remove legacy task scheduler/pool
  • Task: Remove tbb from function names
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 4:35 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 4:38 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 4:41 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 4:46 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 4:57 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 20 2020, 5:07 PM

I was curious about what was crashing here, and found I couldn't reproduce it. Turns out that my build with Clang seems works fine, but GCC it's crashing.

For some strange reason it's failing to copy task parameters for suspended tasks. I tried storing the task parameter a plain struct without any methods, since I suspected placement new or move constructors could be problematic. But the same issue happens. Somehow the presence of the TBB task run code in the same function causes this to fail.

At this point I'm suspecting a compiler bug, though those are of course rare. This works for me, but I don't know why exactly.

--- a/source/blender/blenlib/intern/task_pool.cc
+++ b/source/blender/blenlib/intern/task_pool.cc
@@ -180,6 +180,13 @@ static void tbb_task_pool_run(TaskPool *pool, Task &&task)
     /* Suspended task that will be executed in work_and_wait(). */
     Task *task_mem = (Task *)BLI_mempool_alloc(pool->suspended_mempool);
     new (task_mem) Task(std::move(task));
+#ifdef __GNUC__
+    /* Work around apparent compiler bug where task is not properly copied
+     * to task_mem. This appears unrelated to the use of placement new or
+     * move semantics, happens even writing to a plain C struct. Rather the
+     * call into TBB seems to have some indirect effect. */
+    std::atomic_thread_fence(std::memory_order_release);
+#endif
   }
 #ifdef WITH_TBB
   else if (pool->use_threads) {
source/blender/draw/intern/draw_cache_extract_mesh.c
4690–4692

See rB05721cd00ad1: Mesh Batch Cache: Fix threading issue for why this must be a suspended task pool.

I'm not sure about the other places that use suspended pools. It should be slightly faster not to suspend, but it might not be safe.

Jeroen Bakker (jbakker) marked an inline comment as done.
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)
  • Task: Compiler Issue Suspended Task Pool

Suspended pools were introduced to reduce huge threading overhead on machines with 16+ threads. On such configurations it was faster to avoid any threading overhead on every task push.
With TBB we should eliminate suspended pools, as one would expect TBB will have less overhead.

The idea of task_parallel_range_pool was to allow iteration over multiple ranges with maximum CPU utilization. Imagine you have 2 ranges of 24 elements to be covered on 16 threads machine. In sequential call of task_parallel_range you would have utilization like this:

  • All 16 threads are busy handling first 16 tasks of the first range
  • 8 threads are busy handling 8 remaining elements of the first range
  • All 16 threads are busy handling first 16 tasks of the second range
  • 8 threads are busy handling 8 remaining elements of the second range

If there is no dependency in handling the ranges you can imagine that more efficient would be to:

  • All 16 threads are busy handling first 16 tasks of the first range.
  • 8 threads are busy handling 8 remaining elements of first range, 8 threads are busy handling 8 first elements of the second range.
  • All 16 threads are busy handling remaining 16 elements of the second range.

The intent was to sue this for BKE_subdiv_foreach_subdiv_geometry (see TODO and NOTE in there).

For the performance use something like Vincent's rig. That indicates threading overhead in depsgraph the best. It's also something where more CPU cores you have more threading overhead you have with the old code. While on 4 core 8 threads SKL i7 it wasn't bad, on 22 core 44 threads BDW Xeon the threading overhead is very noticeable in the profiler.

Looking at the TBB documentation we could map task_parallel_range_pool to a custom Iteration Space (https://software.intel.com/en-us/node/506065). I will do some experiments.

@Jeroen Bakker (jbakker), i wouldn't spend too much time on this. It would be possible to implement one way or another. And since currently this codepath isn't in use and requires some changes in the code where it was intended to be used, can happen as part of Modifier/Subdiv optimization.

The patch is still stating it's WIP, withotu reviewers assigned. Will you give an explicit poke when you will consider patch ready for review?

Suspended pools were introduced to reduce huge threading overhead on machines with 16+ threads. On such configurations it was faster to avoid any threading overhead on every task push.

Is this about suspended pools or the local queue (which is also being removed in this patch)? I'm not sure how suspended pools would improve performance.

@Brecht Van Lommel (brecht), Actually, local queue is not used during suspension (arguably, it should from performance point of view, but then pool isn't really suspended). The actual speedup was coming from the change of code flow from

BLI_mutex_lock(...)
BLI_addtail(...)
BLI_condition_notify_one(...)
BLI_mutex_unlock(...)

// ...

BLI_mutex_lock(...)
BLI_addtail(...)
BLI_condition_notify_one(...)
BLI_mutex_unlock(...)

// ... as many times as you've got new tasks to be pushed ...

to

// Single lock block, but notify all.
BLI_mutex_lock(...)
BLI_movelisttolist(...)
BLI_condition_notify_all(...)
BLI_mutex_unlock(...)

Tasks: Removed TaskRangePool

We still need a similar mechanism, but for TBB.

Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 21 2020, 2:49 PM
Jeroen Bakker (jbakker) retitled this revision from [WIP] Task: Use TBB as Task Scheduler to Task: Use TBB as Task Scheduler.

Renamed BLI_task_pool_userdata to BLI_task_pool_user_data in master

  • CleanUp: Comments still referred to task_pool_legacy

CleanUp: Comments still referred to task_pool_legacy

Testing on AMD Ryzen Threadripper 2990WX 32-Core.

Agent 327 , 10_03_B.anim.blend. Animation playback goes from 9.5-10.5 FPS to 13.0-14.0 FPS, which is pretty significant.

I imagine that is where we'd see the most speedup, many cores and multiple animated characters.

source/blender/blenlib/intern/task_scheduler.cc
40–41

It seems this does not work, ./blender -t 2 does nothing here. Apparently this function is deprecated too.

This seems to work: P1345. I also changed it to only do anything if there is an override for the number of threads, and let TBB decide the number of threads itself.

Task: TBB Num Thread Override

Jeroen Bakker (jbakker) marked an inline comment as done.Apr 21 2020, 4:27 PM
Brecht Van Lommel (brecht) requested changes to this revision.Apr 21 2020, 4:33 PM

I tried a bunch of the affected areas and it seems stable to me, I've had no crashes or other issues.

source/blender/blenlib/intern/task_pool.cc
24–25

Add this to fix build without TBB:

#include <memory>
#include <utility>
This revision now requires changes to proceed.Apr 21 2020, 4:33 PM

Tasks: Compilation WITH_TBB=OFF

Note that some external libraries still depend on TBB and we should disable them when this option is turned off.

Jeroen Bakker (jbakker) marked an inline comment as done.Apr 21 2020, 4:42 PM
source/blender/blenlib/BLI_task.h
46–48

Remove line

source/blender/blenlib/BLI_threads.h
49–50

Remove line

tests/gtests/blenlib/BLI_task_test.cc
58

remove line

  • Removed unneeded lines

Tested on Linux Xeon E5-2699 v4 and got FPS boost from 12 to 17 using Spring's 04_010_A.anim.blend. And that's not that many CPU cores compared to Brecht. That's quite awesome speedup to see :)

I've also tested with simple scenes on my macOS laptop. Seemed all nice and stable as well.

Still need to test Windows. Maybe this is something where @Ray Molenkamp (LazyDodo) can help? WINK WINK

This revision is now accepted and ready to land.Apr 29 2020, 2:08 PM
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Apr 29 2020, 4:17 PM

Tested on Linux Xeon E5-2699 v4 and got FPS boost from 12 to 17 using Spring's 04_010_A.anim.blend. And that's not that many CPU cores compared to Brecht. That's quite awesome speedup to see :)

I've also tested with simple scenes on my macOS laptop. Seemed all nice and stable as well.

Still need to test Windows. Maybe this is something where @Ray Molenkamp (LazyDodo) can help? WINK WINK

I don't have the spring files, so no idea on the perf, however all unit tests seem to pass.

This revision was automatically updated to reflect the committed changes.