With the new Cycles architecture, it will be easier to take advantage of SIMD instructions. This can be done in 3 steps.
[ ] Use Embree batching ray tracing
[ ] Use OSL batched shading
[ ] Use batched execution for rest of the kernel
This requires some changes for CPU threads to render multiple paths in a batch. This is relatively easy with the new architecture, which already works like a state machine.
Some technical points:
* For full batched execution of the kernel the integrator state likely needs a SoA memory layout for good performance. However for Embree and OSL this is likely not needed yet.
* Full batched execution of the kernel could be implemented in two ways. Either we could use a dedicated compiler like ISPC for it, or we could use a template library like Enoki.
* The optimal size of batches is unclear. It can range from 16 to millions, depending if it pays of to extract as much coherence as possible with sorting etc, or if a small batch using incidental coherence is better due to less overhead.