This should further improve CUDA performance with small tiles if there
are sufficient AA samples, while explicitly preventing driver timeouts.
I tried to use the existing work stealing code from get_next_work(),
but always found it 5-10% slower than hardware scheduling even when
trying to optimize it to avoid atomics.
Only tested with GTX 1080 on Linux so far. With this change benchmark
scene render times with tile sizes between the full render and 32x32
are within a few % of each other.
It would not surprise me if there are problems with other cards and
platforms though, that needs to be tested more.
Depends on D2856