(This is a duplicate of T48651, moved to the differential system by Thomas' suggestion)
This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all.
One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler.
At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.