The work size is still very conservative, and this doesn't help for progressive refine. For that we will need to render multiple tiles at the same time. But this should already help for denoising renders that require too much memory with big tiles, and just generally soften the performance dropoff with small tiles.
For the benchmark scenes on a GTX 1080, going down from tile size 256x256 to 32x32 doesn't seem to lose any performance. Rendering with bigger tiles that cover the entire image is still a bit faster though.
Note the GTX 1080 has 2560 cores, and the heuristics results in a minimum work size of 25600 on that card (assuming there are sufficient AA samples). That corresponds to a tile size of 160x160 with 1 sample which we know is not the fastest. We can get close to that bigger tiles performance by multiplying step_samples by 8 for example, but I'd like to have a better solution for driver timeouts before we automatically increase the work size that much.
This required a bunch of refactoring, see the History tab to inspect individual commits. Overall code ended up simpler than before.