This is an initial implementation which seems to give better
device utilization here when using two non-matched GPUs, as
well as multi-GPU and CPU.
General idea is to balance amount of work based on an
observed performance of devices, and "re-slice" the big tile.
Things which are known to be not final but considered a further
development:
- The balancing algorithm might need some tweaks for the objective function and weight modification to converge to the ideal balance quicker.
- The "re-slicing" might also be optimized memory-wise.
- Headless rendering needs to give few iterations of smaller works to allow multi-device to settle down in the balance.
The balancing logic is in own little file, which simplifies
process of experiments.