Turns out that this is more code than I thought it was, so submitting it for review now before things get further along.
This patch refactors/cleans up the split kernel to make it more generic and allow it to be implemented for other devices besides OpenCL. Implementation for CPU devices is included has the same functionality as the OpenCL split kernel. This mainly allows for better debugging of the split kernel, but also has other benefits such as build time compiler errors.
Main things changed:
- Use one large buffer for all data shared between the kernels instead of many buffers that need to passed around everywhere
- All kernels now have the same signature, and all kernel logic has been moved from *.cl files into reusable .h files in kernel/split/
- New class DeviceSplitKernel that holds all host side logic for the split kernel
- Automatic tile splitting has been removed. The code was a bit messy and couldn't be made generic easily. Unfortunately this means if the tile size is set too large the system can run out of memory (or even hang? need to investigate more)
Would appreciate any comments on style/naming, some of this could probably be rearranged better.
I mentioned this in the commits, but I'll mention it here as well, the split kernel on CPU is a bit slower than the mega kernel (~13% with BMW), this will be investigated later.
Please check the branch for further details.