I suggest an implementation that is 1.5-2.1 times faster (depends on optimization options and arch), that reduces total rendering time by about 1%.
It gives a result that is on average 6° different from the old implementation. The difference is because normals (Ng, N, N') are not selected to be coplanar, but instead reflection R is lifted the least amount and the N' is computed as a bisector.
I composed a scene that demonstrates the difference in its extreme. Changed pixels have normals pointing away from camera. The situation itself is not physically correct and only comes from a possibility of ray coming from below the surface, so any reasonable solution for the normal will be more or less fine.
Old implementation vs new
I also composed a cpu-only test-bed. Note line 188.
Building and running the test:
g++ -O2 ensure_valid_reflection.cpp ./a.out

