At the moment we don't use any special intrinsics (no KERNEL_AVX yet), but gcc auto vectorization gives a small speedup.
AVX is available on Intel Sandy Bridge and later, and AMD Bulldozer and later.
bmw.blend from test suite renders 3s faster with this (1:44 min -> 1:41min). Could not test more advanced files due to current crashers in master.
@Sv. Lockal (lockal): Can we achieve more with this, by using special AVX intrinsics? Or are there more compiler flags we can utilise for this kernel?
I know that an AVX2 kernel would be more interesting, due to FMA3, but Haswell is pretty new and if we can improve performance on Sandy/Ivy Bridge, we should try. AVX2 kernel can be added too later.
Anyway, just dumping this here for testing and feedback, don't see an urgent reason to include this yet.