Page MenuHome

Task scheduler: Avoid threading overhead when there are lots of tasks to handle
AbandonedPublic

Authored by Bastien Montagne (mont29) on Dec 19 2016, 3:40 PM.

Details

Summary

The idea of this work is to avoid mutex locks on every task acquisition and
replace it with spin lock. The claim here: in normal conditions it'll almost
always will take less than 1 iterations of the loop to get a new task to work
on. That means we'll spin really a bit and if we send thread to sleep that
will run any performance.

This is a patch which attempts to implement this idea, and it gives really
nide speedup in files like the one from T50027 (with all cached physics) for
both old and new depsgraph.

Unfortunately, seems there's a mistake somewhere so some production scenes
here crashes time to time.

There are also some TODOs in the code which i'd like to discuss and have a
brainstorm about.

Diff Detail

Repository
rB Blender
Branch
bli_task
Build Status
Buildable 354
Build 354: arc lint + arc unit

Event Timeline

Sergey Sharybin (sergey) retitled this revision from to Task scheduler: Avoid threading overhead when there are lots of tasks to handle.
Sergey Sharybin (sergey) updated this object.

Overall like the new, less blocking design.

However, unless I’m mistaken, this means our notification system is not atomic anymore. We can have race conditions where some thread would 'miss' some wakeup, since the predicate controlling thread sleeping (whether it finds a suitable task in scheduler queue) is not protected anymore by the queue_mutex.

E.g. you can have added a task to the pool & scheduler after a worker thread has checked for available tasks, but before it goes waiting on condition, so the notification will be lost for that one anyway, it will remain sleeping until next notification.

This means you do not really need mutex locking around condition notifications anymore imho, the 'predicate and condition' atomicity is no more ensure anyway.

Since we are also awaiting in 'main' thread (in BLI_task_pool_work_and_wait()), there is in theory a possibility to end up in a dead-lock situation where all workers and main thread would await on condition (hence none would be able to notify it). This is probably very, very unlikely to happen, even less as you add more worker threads, not sure how likely it is with only two threads though (main + one worker)…

One possible solution would be to never wait on condition in 'main' thread (BLI_task_pool_work_and_wait), and instead do mere nanosleep here e.g.?

Also wondering if you tried a more radical, condition-free approach, where condition waiting would be replaced by nanosleep (maybe even with adaptive delay, increasing when no task is found for a given amount of iterations in a row e.g.)? Something similar to an event handling loop. Combined with an atomic flag or counter for checking whether the queue is empty, time spent in task_scheduler_thread_wait_pop on each iteration when there is nothing in queue would be very, very small (you do not even have to lock the spinlock then).

source/blender/blenlib/intern/task.c
236–240

Think we could get rid of done counter at least... It’s only used for stats report in one place afaict, and this can be achieved by other ways (like checking pool->num, not as accurate but would work with high numbers of tasks, or just add your own 'done' counter in your tasks' code if you really need it)?

About limited number of threads, am not sure… I wonder why we do need this? Seems to only be used in one place, OGL render?

243–249

Eeeeh… comment is quite not accurate here? If I’m correct, this notification is *only* needed to ensure BLI_task_pool_work_and_wait() is not blocked? I.e. not the worker threads (which shall remain sleeping in this case, in fact), but the main 'calling' thread.

244

You could use result from atomic_sub_and_fetch_z(&pool->num, done); here, would avoid this piece of code potentially being called multiple times?

273–302

I would try the second idea.

Adding an atomic counter of waiting threads, increased just before locking and awaiting on condition, and decreased just after condition return and unlocking, looks like a very good way to always know whether we have sleeping threads and hence should signal condition. This could also be used in task_pool_num_decrease() above.

I don’t think atomicity is an issue here anymore, see my comment about it in main message.

690

You can as well put the queue_mutex lock/unlock inside the if imho? pool->num is not protected by that mutex, so it’s not preventing us from race conditions anyway...

Sergey Sharybin (sergey) edited edge metadata.

Implement idea of traversing queue second time from inside a mutex lock

This way our notification should become more atomic.

Unfortunately, this halves performance gain for me here.

source/blender/blenlib/intern/task.c
243–249

On exit we join to every worker thread, so need to make sure they do not wait for anything.

690

Yes, but if it's inside of mutex lock then it's possible to have num decreased before entering mutex-lock here which will dead-lock us.

So guess instead we need to guard pool->num here.

Tried doing nanosleep here, but that kills any speed benefits.

Found and fixed two crashers (race conditions leading to use-after-free issues). Feel free to commandeer back of course ;)

  • Fix two crashers/deadlockers (concurrent freeing leading to use-after-free bugs).