Page MenuHome

Cycles: Add an AVX kernel
ClosedPublic

Authored by Thomas Dinges (dingto) on Jan 15 2014, 12:51 AM.

Details

Summary

At the moment we don't use any special intrinsics (no KERNEL_AVX yet), but gcc auto vectorization gives a small speedup.
AVX is available on Intel Sandy Bridge and later, and AMD Bulldozer and later.

bmw.blend from test suite renders 3s faster with this (1:44 min -> 1:41min). Could not test more advanced files due to current crashers in master.

@Sv. Lockal (lockal): Can we achieve more with this, by using special AVX intrinsics? Or are there more compiler flags we can utilise for this kernel?

I know that an AVX2 kernel would be more interesting, due to FMA3, but Haswell is pretty new and if we can improve performance on Sandy/Ivy Bridge, we should try. AVX2 kernel can be added too later.

Anyway, just dumping this here for testing and feedback, don't see an urgent reason to include this yet.

Diff Detail

Event Timeline

Indeed, GCC has some basic autovectorization for AVX, but personally I have never seen any example of float[8] code in blender, which gcc is able to autovectorize. One immediate change cycles has with avx1 kernel is that it uses only one instruction for __m128 _mm_set1_ps(float x) -- VBROADCASTSS, instead of 2 (VMOVSS + VPSHUFD) (probably not in clang, though).

I think we should not do an AVX1 kernel and just do a Haswell kernel but keep it disabled for the time being.
AVX1 does not seem well suited and complete enough as an instruction set by itself to warrant the effort.

AVX2 + FMA combined with the more mature 256 bit vector support seems to be more fitting as the next level we should support.

Keep in mind that I am not an SIMD/Intel expert by a long run but I think we have more to gain by either making an avx2 kernel or setting it up that so that the user can effectively build a native or custom kernel as level beyond sse41 for the time being

3% faster render is not bad at all, I'd consider enabling this by default if it gives that kind of speedup in general.

intern/cycles/kernel/kernel_avx.cpp
26–32

After my last commit these lines have to be moved above the util_optimization.h include.

Thomas Dinges (dingto) updated this revision to Unknown Object (????).Jan 15 2014, 7:57 PM

Updated for your latest changes @Brecht Van Lommel (brecht).

I rendered the Caminandes test file with 3b5fa7b and also got a nice speed boost here.

SSE41 kernel: 48:14min
AVX kernel: 46:53min

That is ~3%. 1s of animation (24 frames) would be 81s*24frames = 32min saving.

I think the biggest speedup happens in scenes where we use SSE intrinsics already (Images, Noise textures...), basically what @Sv. Lockal (lockal) said above about the instructions.

I tested some other scenes and none was slower, some are 1% faster, others are almost the same. So I think we can justify to include this?

Ah one more thing, this is disabled on Windows atm, MSVC 2008 does not have AVX intrinsics. When we switch to VC2012/2013 we can use /arch:AVX.

I think it's fine to commit this.