Page MenuHome

Attempt to address performances issues of task scheduler with lots of very small tasks.
AbandonedPublic

Authored by Bastien Montagne (mont29) on Dec 21 2016, 9:15 PM.

Details

Summary

This is partially based on Sergey's work from D2421, but pushing the things a bit further. Basically:

  • We keep a sheduler-counter of TODO tasks, which avoids us to do any locking (even of the spinlock) when queue is empty, in workers.
  • We spin/nanosleep a bit (less than a ms) when we cannot find a task, before going into real condition-waiting sleep.
  • We keep a counter of condition-sleeping threads, and only use condition notifications in case we know some are waiting on it.

In other words, when no tasks are available, we spend a bit of time in a rather high-activity but very cheap and totally lock-free loop, before going into more expansive real condition-waiting sleep.

No noticeable speedup in complex production scene (barbershop one), here master, D2421 and this code give roughly same performances (about 30% slower in new than in old despgraph).

But with testfile from T50027 and new depsgraph, after initial bake, with master I have ~14fps, with D2421 ~14.5fps, and with this code ~19.5fps.

Note that in theory, we could get completely rid of condition and stay in the nanosleep loop, but this implies some rather high 'noise' (about 3% of CPU usage here with 8 cores), and going into condition-waiting state after a few hundreds of micro-seconds does not give any measurable slow down for me.

Also note that this code is only working on POSIX systems (so no Windows, not sure how to do our nanosleeps on this OS :/ ).

Diff Detail

Repository
rB Blender
Branch
unlock_task_scheduler_II
Build Status
Buildable 364
Build 364: arc lint + arc unit

Event Timeline

Bastien Montagne (mont29) retitled this revision from to Attempt to address performances issues of task scheduler with lots of very small tasks..
Bastien Montagne (mont29) updated this object.
  • Cleanup, factorization, comments, and some fixes for potential issues.

There is indeed nice speedup in the rigid body file with everything cached. However, what's happening here is that it's most likely falls to a single thread update because others are sleeping.

In the production file i've tested (14_03_G) i've got 15% slowdown on my i7-6800K with 12 threads.

source/blender/blenlib/intern/task.c
121–131

Do we really need num_queued? Can we instead test queue.head != NULL ?

126

Would use num_ prefix here.

368

Think we should get rid of per-pool num threads limit and use global scheduler thread instead.

Would require changes in some areas (different scenes can use different threads number..) But maybe we can reconsider something there for 2.8 project while we are allowing to break some compatibility.

Perhaps not highest priority tho.

386

Consider using restricted pointers. Even tho it's an inlined function compilers are not always smart enough to detect that there's no pointer alias.

1007

Just ditch it out.

Good code MUST NOT make decisions based on neither current thread index, nor on the amount of tasks done, not on anything else coming from the scheduler specific behavior.

Bastien Montagne (mont29) marked 2 inline comments as done.Dec 22 2016, 2:27 PM

15% slower is a lot, and I really cannot reproduce that here, with own 8 threads this patch is maybe ~1% slower (barely measurable, since I get up to 10% variations of fps between different runs of same patch, probably due to CPU heating and dynamic OC…), using 14_03_G. Could be explained by the difference between a laptop and a workstation I guess?

There is indeed nice speedup in the rigid body file with everything cached. However, what's happening here is that it's most likely falls to a single thread update because others are sleeping.

I don’t think so? Task monitor shows height cores at work for me at least - in less 'coherent' way than with D2421 (with this patch, they vary between 20% and 60%, while with D2421 they are more steadily around 40%).

Anyway, will keep experimenting a bit more.

source/blender/blenlib/intern/task.c
121–131

Not sure… queue.head is protected by spinlock, so not sure reading it outside of spinlock would be safe? On the other hand, we would mainly risk false non-NULL result I guess… Will give it a try.

368

I do not think it costs us much right now, so would indeed leave it for later.

1007

Welll… This is only used for info/stats purpose, where it avoids caller to implement own counting of done work…

Thing is, cores being busy might be coming from them spinning instead of doing helpful tasks.

The reason why you might not experience such a slowdown might be because you don't have so much threading conflicts (thread need to wait less for the locked section to be leaved by another thread).

Thing is, cores being busy might be coming from them spinning instead of doing helpful tasks.

Here the eight cores spinning over empty list generate ~3% of core usage. And that patch also boosts performances with old depsgrah in the rigidbody testfile, in master I have ~20fps and with this patch, ~24fps…

In any case, I don’t think it is a bad thing to go mono-thread here, thing is, generating the nodes seems to take nearly as much time as computing them, so we cannot feed the threads quickly enough anyway, no point in using them then.

The reason why you might not experience such a slowdown might be because you don't have so much threading conflicts (thread need to wait less for the locked section to be leaved by another thread).

Yeah, experimenting around that idea…

Bastien Montagne (mont29) edited edge metadata.
  • Cleanup, factorization, comments, and some fixes for potential issues.
  • Some minor changes from review.
  • Fix use-after-free concurrent issues.

OK, so… After a hard day of work, trying several ideas to improve even more this piece of crap with no success, am really starting to think we cannot ask much more from the task scheduler… Unless TBB can really do better?

EDIT Had one or two more ideas this night, actually, will try them today…

The rigidbody testfile

I.e. testfile from T50027. For me, D2426 keeps giving far better performances in all cases, up to 25% speed gain compared to current master.

This is probably due to the fact that generating tasks in that case takes nearly as much time as crunching them, so we can only feed one or two worker threads (kept alive thanks to the nanosleep busy loop, while the others spend most of their time awaiting on condition?).

The production file

I.e. 14_03_G file. I still cannot reproduce major slowdown between D2421 and D2426 here (tried with 4, default [8] and 16 threads, after fixing crapy bug in master, and got similar speed in all cases…). In fact, master code seems to give best results with that file, though it is a matter of one or two percents differences in all cases…

Further more, in all cases all 8 cores stay busy at around 80% here, and do not see anything that could explain why D2426 would be worse than D2421 in a situation where all workers are easily fed with tasks.

Unless with a beefy 12cores workstation, single thread generating the tasks is not enough anymore compared to 11 others crunching them, and starvation happens here as well. In fact, only case I can think of where D2426 could be significantly slower is a 'barely starving' situation, where task generation is just a bit slower than task crunching.

  • Cleanup, factorization, comments, and some fixes for potential issues.
  • Some minor changes from review.
  • Fix use-after-free concurrent issues.
  • Never starve main thread (run_and_wait) from tasks!
  • Attempt to address nearly-starving cases.
  • Revert "Attempt to address nearly-starving cases."
  • Inline all task-pushing helpers.
  • Do less nanosleep loops.
  • Even more radical approach: lock-free, with a dedicated tasks manager thread...
  • More work, still not really working...
  • Some more fixes and adjustements, still failing...

undo last patch, stupid arcanist :(

So, since am stubborn I kept trying things this week-end, without much success. Here is the last version of this patch, not much to add here, still cannot reproduce the slowdown on 14_03_G, whatever I try…

Am fed up with the scheduler for now, and tempted to conclude that it can’t be really efficient when fed with gazillions of small tasks, so we should rather spend time on checking whether we cannot better pack tasks in the depsgraph maybe? Unless tbb shines better (has been coded by bunch of much smarter guys after all ;) ).

But imho small tasks will always be a PITA, locked scheduler or not - and lockfree scheduler is at best very, very hard to get working (and from “quick” test in D2435, does not even look tremendously more efficient)…

Update against latest master.

Updated against latest master...