This switches api usage for cuda towards using more of the Async calls.
Updating only once every second is sufficiently cheap that I don't think it is worth doing it less often.
Differential D262
Cuda use streams and async to avoid busywaiting Authored by Martijn Berger (juicyfruit) on Jan 27 2014, 11:19 PM.
Details
This switches api usage for cuda towards using more of the Async calls. Updating only once every second is sufficiently cheap that I don't think it is worth doing it less often.
Diff Detail
Event TimelineComment Actions I also tested with 2 and 3 cards and in both cases blender uses close to no cpu time during rendering. I still need to test on windows but I am confident about that.
Comment Actions Is render time affected or is performance the same? I would expect it to be ok but it's good to check anyway. Comment Actions Windows 7 x64, i7 Quad Core, Geforce 540M Vanilla With Patch bmw.blend, 128x TileSize. Comment Actions It is not without a very minor performance impact When setting the tiles size very small and having 1000 tiles I am measuring ~3 to ~5ms per tile in extra cost latency. I think that this is due to how the async stuff is implemented. I think this is worth it as typical Cuda setup would use large tiles and then this cost amounts to way less then 1%.
Comment Actions Ok, that seems acceptable. Probably what we have to do to avoid this performance loss (and improve performance probably) is to queue multiple tiles to render on the same GPU, by using multiple streams. Doug Gale and I made a patch for this some time ago but never finished it. Here's the old diff in case anyone wants to experiment with that, it won't apply cleanly in the current code but gives an idea.Comment Actions I think we can also use the same stream for all operations as now it seems we have 2 streams. 1 for the blocking api that is implicit and an explicit stream. Also ill look at Doug's work and take some things from there. From memory the multiple tiles per device did also have other advantages. Comment Actions Hi, tested with patch give small performance differences, depends on scene: Vanilla patch My Bench The Tee Opensuse 13.1/64 CPU usage 15-20% from 400% Cheers, mib. Comment Actions @Brecht Van Lommel (brecht) should I just merge this as is and is that allowed with current bconf level Comment Actions reopened as of rB6b1a4fc66e Cycle CUDA: revert the f1aeb2ccf4 and 84f958754 busywait fixes for now. Comment Actions I changes code so that after 10 samples it calculates the amount of samples it might do in 1000 milliseconds and then it forces a sync. and evaluates again how far from target value it is and so on. it should sync about once every 1 seconds but we might change that. For real scenes I get a minor speed up over master. And tiles update. they might update a bit less often Comment Actions Cleaned up patch as it is reverted for good reasons. Todo:
Comment Actions I'm wondering if this is actually thread safe. We are writing multiple samples to the same pixels in parallel, without any atomic operations. I would think that can fail, especially with small tiles? Or are these kernel launches already guaranteed to run one after the other on the GPU? Maybe we should add atomic operations for writing to passes in the kernel, I think we'll need to do this sooner or later anyway.
Comment Actions Thinking this need further updates:
| |||||||||||||||||||||||||||||