While this is working OK-ish (it can even be marginally faster than
previous, thread-locking iteration code), this is not really satisfying
in the end, speed improvements are rather disappointing.
Think the core of the issue here is that this code will typically only
actually do something for just a few percents of the whole iterated
range (best performances are typically achieved with dynamic scheduler
and chunk size of 1!).
This is not really suited for the BLI_task_parallel_range() code. That
API just introduces too much overhead in the looping/control mechanism,
when most iterations are just "continue'd" after a very quick test.
Going to a lock-less reshuffle of existing code, on the other hand,
gives above 5% speedup, so...