Page MenuHome

Fix T88598: Support disabling task isolation in task pool.
ClosedPublic

Authored by Jacques Lucke (JacquesLucke) on May 27 2021, 4:19 PM.

Details

Summary

See T88598 for more details on the problem this patch is trying to solve.

It's a bit unfortunate that task isolation was the default for all task groups for a while now, even in cases when it wasn't necessary. This makes it more risky for remove task isolation from the task groups that don't need it.
This patch implements the lazy solution of just allowing individual task groups to disable task isolation.

Diff Detail

Repository
rB Blender
Branch
support-disabling-task-isolation (branched from master)
Build Status
Buildable 14805
Build 14805: arc lint + arc unit

Event Timeline

Jacques Lucke (JacquesLucke) requested review of this revision.May 27 2021, 4:19 PM
Jacques Lucke (JacquesLucke) created this revision.

Task isolation was turned on as tasks could be run on the same thread that acquired the same lock. So perhaps we should add a note here that it should not be used together with thread synchronization code like mutex/signals.
About the naming BLI_task_pool_disable_task_isolation isn't ideal. Other options I can think of are to pass with the create methods or BLI_task_pool_set_isolation_level.
Personally I prefer to add it as a parameter of the create functions so it is explicit.

Determine if task isolation is used in create-function.

Generally, it feels a bit like a hack to have the task pool deal with task isolation imo. It feels like a workaround because we don't have a wrapper for tbb::this_task_arena::isolate that can be used from C yet.
Ideally, we would not task isolation by default and only add it to the places where it's really needed.

From what I can tell in this document you linked:
https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_12

This would happen when tasks in a pool add more tasks to the same pool. Because then task_group.run() is called from inside the isolated region while task_group.wait() is outside. Is that the case where you ran into this issue?

If so I guess the option could be "allow recursively spawning tasks", and there could be a check in debug mode to verify to ensure that is followed. And it also means there's an easy way to check existing code using task pools code to see if it needs to disable isolation.

Yes and no. In my specific case the deadlock was indeed caused by the fact that I pushed tasks from within another task. But that is not the only way to get this deadlock. Here is a simplified snippet that deadlocks when Blender is started with --threads 1. In this example, the deadlock does not happen when more threads are used, because then another thread will execute the task from the inner isolated region.

#include <tbb/task_arena.h>
#include <tbb/task_group.h>

static void my_deadlock_test()
{
  tbb::this_task_arena::isolate([]() {
    tbb::task_group task_group;
    tbb::this_task_arena::isolate([&]() { task_group.run([] {}); });
    task_group.wait();
  });
}

Whenever this pattern happens on all threads at the same, we get a deadlock on all threads. Throwing in a couple of calls to tbb::this_task_arena::isolate when it is not necessary will make this pattern more likely.
I think calling the option allow_recursive_task_spawning hides the fact that there are more things to consider (which is unfortunate, but I'm not sure if we can really just hide this).

Just for fun, here is another way to deadlock Blender when started with --threads 1: P2142. Admittedly, this is not really different from the previous cases, but I found it interesting nevertheless.


Adding the debug check to see where isolation needs to be disabled is a good idea anyway, but it can't detect all potential deadlocks.

But that is not the only way to get this deadlock. Here is a simplified snippet that deadlocks when Blender is started with --threads 1. In this example, the deadlock does not happen when more threads are used, because then another thread will execute the task from the inner isolated region.

That abstraction may indeed be too leaky in practice. I'm fine calling it isolate_tasks then, but it needs a good description in BLI_task.h.

Just for fun, here is another way to deadlock Blender when started with --threads 1: P2142. Admittedly, this is not really different from the previous cases, but I found it interesting nevertheless.

This one I don't understand. Not sure how task_group.run() and task_group.wait() would end up being called from different isolation regions here?

Adding the debug check to see where isolation needs to be disabled is a good idea anyway, but it can't detect all potential deadlocks.

Right, it's not fool proof. But I guess it's simple enough to add code that checks that all task group operations are done in the same thread (using tbb::this_thread::get_id()) when isolate_tasks=true`.

This one I don't understand. Not sure how task_group.run() and task_group.wait() would end up being called from different isolation regions here?

Answering my own question. I guess the depsgraph recursively creates tasks, and the main thread being isolated restricts it from handling those.

Answering my own question. I guess the depsgraph recursively creates tasks, and the main thread being isolated restricts it from handling those.

Yes exactly.

  • Use enum instead of bool.
  • Improve comments.
  • Add heuristic to help avoid running into deadlocks caused by task isolation.
  • Disable task isolation for task pool used by depsgraph evaluation.
Hans Goudey (HooglyBoogly) added inline comments.
source/blender/blenlib/BLI_task.h
88

Extra s

91

Extra comma

  • Merge branch 'master' into support-disabling-task-isolation
  • fix typos
This revision is now accepted and ready to land.Jun 7 2021, 2:32 PM