There are couple of aspects here.
First idea is to utilize the fact then for scenes without volumes it's not
really needed to traverse intersection in the z-sorted order. This is
achieved by special BVH traversal function which has a callback function
which is called on every intersection. Form within this function we can
modify the throughput or abort traversal if we hit something opaque.
From tests with koro_blend.blend had about 2x speedup (1:32 with the RC1
build and 0:38 with the patched master). Be careful when comparing results
because GPU version used to give some wrong results because of missing
intersections (as far s i can see, this is caused by moving ray too much
further far the bounce, which moves it out of the Koro's fur).
This speedup only works for scenes without volumes. It also requires a
bit more VRAM: since it's one extra BVH traversal it needs one more
stack which isn't getting de-duplicated by the CUDA compiler. For my
current GTX1080 it is about 48 megabytes,
Second idea is about optimization transparent shadows on GPU for scenes
with volume and few transparent bounces.
This is reasonably faster to perform array sort on a scenes with few
transparent shadows than to do full BVH intersection query for each
of the intersection step. Unfortunately, it gives some more memory
usage penalty and now it's about 20 megabytes (weirdly, it's half
of the previous bump, so maybe some of the static arrays were de
duplicated this time).
Makes koro.blend rendered in the same time even after adding a volume
to the scene.
Currently this onyl works for until 16 transparent bounces.



