Basic idea is to use primitive visibility flag for this instead of
fetching the triangle index, shader, shader flags for each of the
intersections.
This seems to give about 2% speedup on my laptop,
Just for the record -- we're now getting out of free bits in the
primitive visibility flags, so we might want to split them up so
o.e. flags controlled from the interface doesn't share the same
bit field as the "internal" flags.