Page MenuHome

Cycles: Experiment with removing closures storage for certain evaluation
ClosedPublic

Authored by Brecht Van Lommel (brecht) on Sep 21 2016, 3:46 PM.

Details

Summary

The idea is to avoid storing the whole array of closures when we only need
that to sum certain closure types later. Instead we sum closure weight once
we are about to add the closure to storage, but without actually adding it
to the storage.

Simple idea which is a bit more tricky to implement, because we don't want
to introduce the whole new bunch of API functions, especially the one for
SVM nodes evaluation.

The current idea is to tag ShaderData with special flag that we want shader
evaluation to sum certain type of closures. The sum is done in the SVN NSDF
code (fast-forwarding ahead: OSL is not yet supported). After that we can
read sum weight from the ShaderData directly.

Now, in order to avoid code duplication, we use trickery of allocating stack
memory of a size of ShaderData without closures array, and pass as if it's
a proper ShaderData. Closure allocation adds some guard that we don't try
to allocate closures in such ShaderData, but other than that there's nothing
what would forbid one to shoot his own foot.

Such a silent substitution of "lite" version of ShaderData is something
what makes me unhappy about this patch and something where i wouldn't
mind having discussion about possible ideas.

  • Maybe we should introduce ShaderDataLite instead and have some special API to deal with that?

    (would be annoying to copy-paste all shader_setup functions tho)
  • Have some persistent flag in ShaderData, so we always know whether we deal with proper ShaderData or with a lite one?

    Would at least make checks in official API more robust, because then we wouldn't by accident reset sum flag and try to use the closure array.

All in all, this seems promising direction. On 1080 gives about 210MB
VRAM saving (which we can later re-use for shadow_blocked(), so don't
get too excited) and it also gives few percent of speedup in the koro
scene by the looks of it.

Diff Detail

Repository
rB Blender
Branch
cycles_shadow_closure_v2
Build Status
Buildable 944
Build 944: arc lint + arc unit

Event Timeline

Sergey Sharybin (sergey) retitled this revision from to Cycles: Experiment with removing closures storage for certain evaluation.
Sergey Sharybin (sergey) updated this object.

Eh, crap. I've read the comment in D2226 too late and totally forgot about D2023. So we kind of did similar things now. But thing is, i didn't see speed regressions in my tests, so something is different still.

How do we move forward now? :)

I like this patch, it should be faster for shadows/emission/background than what I did in D2023, avoiding closure storage entirely.

I didn't have time yet to review the code properly though. I can run some benchmarks with this and other patches, tomorrow or in the weekend.

intern/cycles/kernel/kernel_types.h
994

Shouldn't this one be at the very end of the struct if it's the part that we're leaving out?

At least svm_closure_weight is still used even if we only sum the closure weights, so we should be able to access that without reading past the end of the struct.

intern/cycles/kernel/kernel_types.h
994

That's a good point. Will fix that and update patch soon.

Sergey Sharybin (sergey) edited edge metadata.
  • move closures to the end of ShaderData
  • Make BPT benefit from this work as well

Fix for stack corruption with BPT

indirect_sd can't be squashed, so we'll have less memory benefit for BPT now.

Perhaps there are tricks to improve this tho :)

Have a feeling that sd can be simplified, but needs a closer look.

Testing on Linux, results are good with NVidia CUDA but not AMD OpenCL. Did not figure out yet what the specific issue is.

GTX 960 render timeR9 380 render time
BMW-0.2%+5.2%
Fishy Cat-0.8%+1.1%
Pabellon Barcelona-4.2%+4.3%
Classroom-3.2%
Koro+0.1%
intern/cycles/kernel/kernel_shader.h
1029

Should be using ccl_fetch().

1247

Must use ccl_addr_space PathState.

intern/cycles/kernel/svm/svm_closure.h
30–50

Should be using ccl_fetch().

Fixes for ccl_fetch() reported by Brecht.

Attempts to solve OpenCL slowdown:

  • Inline the helper function.
  • ifdef volume-only code, volume closures are not supported on OpenCL anyway.
  • Remove sum check from closures which never use it.

Let's see if that gives any difference.

TODO: sd_shadow could be reduced on OpenCL which would lover the memory use there as well.

Update against latest master

Need to re-benchmark, but afraid this gives some slowdown on CUDA and koro.blend.

Could be caused by switch to CUDA 8, not sure yet.

Actually, just re-benchmarked and didn't see much time difference. Could be that i had some interference perhaps.

For the OpenCL i also didn't see much difference neither in time nor in memory. I was on NVidia OpenCL tho. For the memory optimization for OpenCL we'll need to do some special trickery in the split kernel, which was a bit complicated with SoA...

But perhaps we can process forward with CUDA for now?

Update against latest master

@Brecht Van Lommel (brecht), do you think it's something cool to have in master?

There is one TODO remained, which is OSL support. This one i think will be
simple to solve if we agree this patch is useful in general/

I think it's absolutely worth doing this to save stack memory.

This update includes the following:

  • Fixed sampling of volume extinction sampling, which was not doing proper weight for absorption. This made all non-OSL regression tests to pass.
  • Initial support of OSL, which turned out to be quite tricky because we need to know certain closure ID types early on. Hopefully, code is till clear enough.

Render results on NVidia GTX 1080

Results on NVidia cards are quite promising.

Scenemaster timemaster memorypatch timepatch memory
bmw2701:43.92109701:42.20897
classroom04:22.59127404:24.311072
fishy_cat04:34.77144204:30.121246
koro09:23.09142609:23.381229
barcelona08:54.90111108:51.97911

Render results on AMD Radeon RX 480

AMD card is more depressing, initial patch time was actually done without
reducing storage size for emission_sd/sd_DL. Using reduced size of that
shader data solved some speed regression (i.e. fishy_cat went down to
about 8min) but was still slower than baseline.

Tweaks patch is

1diff --git a/intern/cycles/kernel/closure/bsdf_util.h b/intern/cycles/kernel/closure/bsdf_util.h
2index 865bae2ccf3..b736e4e28a3 100644
3--- a/intern/cycles/kernel/closure/bsdf_util.h
4+++ b/intern/cycles/kernel/closure/bsdf_util.h
5@@ -156,65 +156,84 @@ ccl_device_forceinline float3 interpolate_fresnel_color(float3 L, float3 H, floa
6 return cspec0 * (1.0f - FH) + make_float3(1.0f, 1.0f, 1.0f) * FH;
7 }
8
9-ccl_device_inline bool shader_closure_weight_sum(ShaderData *sd,
10- uint type,
11- float3 closure_weight,
12- float mix_weight)
13+ccl_device_inline bool shader_closure_bsdf_weight_sum(ShaderData *sd,
14+ uint type,
15+ float3 closure_weight,
16+ float mix_weight)
17 {
18 if(sd->sum_closure_flag == SD_SUM_CLOSURE_NONE) {
19 return false;
20 }
21- switch(sd->sum_closure_flag) {
22- case SD_SUM_CLOSURE_NONE:
23- break;
24- case SD_SUM_CLOSURE_TRANSPARENT:
25- if(type == CLOSURE_BSDF_TRANSPARENT_ID) {
26- const float3 weight = closure_weight * mix_weight;
27- sd->sum_closure_weight += weight;
28- }
29- break;
30+ if(sd->sum_closure_flag != SD_SUM_CLOSURE_TRANSPARENT) {
31+ return true;
32+ }
33+ if(type == CLOSURE_BSDF_TRANSPARENT_ID) {
34+ const float3 weight = closure_weight * mix_weight;
35+ sd->sum_closure_weight += weight;
36+ }
37+ return true;
38+}
39+
40 #ifdef __VOLUME__
41- case SD_SUM_CLOSURE_VOLUME:
42- if(CLOSURE_IS_VOLUME_SCATTER(type)) {
43- const float3 weight = closure_weight * mix_weight;
44- const float sample_weight = fabsf(average(weight));
45- if(sample_weight < CLOSURE_WEIGHT_CUTOFF) {
46- return true;
47- }
48- sd->flag |= SD_SCATTER;
49- sd->sum_closure_weight += weight;
50- }
51- else if(CLOSURE_IS_VOLUME_ABSORPTION(type)) {
52- const float3 absorbtion_weight =
53- make_float3(1.0f, 1.0f, 1.0f) - closure_weight;
54- const float3 weight = absorbtion_weight * mix_weight;
55- const float sample_weight = fabsf(average(weight));
56- if(sample_weight < CLOSURE_WEIGHT_CUTOFF) {
57- return true;
58- }
59- sd->flag |= SD_SCATTER;
60- sd->sum_closure_weight += weight;
61- }
62- break;
63+ccl_device_inline bool shader_closure_volume_weight_sum(ShaderData *sd,
64+ uint type,
65+ float3 closure_weight,
66+ float mix_weight)
67+{
68+ if(sd->sum_closure_flag == SD_SUM_CLOSURE_NONE) {
69+ return false;
70+ }
71+ if(sd->sum_closure_flag != SD_SUM_CLOSURE_VOLUME) {
72+ return true;
73+ }
74+ if(CLOSURE_IS_VOLUME_SCATTER(type)) {
75+ const float3 weight = closure_weight * mix_weight;
76+ sd->flag |= SD_SCATTER;
77+ sd->sum_closure_weight += weight;
78+ }
79+ else if(CLOSURE_IS_VOLUME_ABSORPTION(type)) {
80+ const float3 absorbtion_weight =
81+ make_float3(1.0f, 1.0f, 1.0f) - closure_weight;
82+ const float3 weight = absorbtion_weight * mix_weight;
83+ sd->flag |= SD_ABSORPTION;
84+ sd->sum_closure_weight += weight;
85+ }
86+ return true;
87+}
88 #endif /* __VOLUME__ */
89- case SD_SUM_CLOSURE_EMISSION:
90- if(CLOSURE_IS_EMISSION(type)) {
91- const float3 weight = closure_weight * mix_weight;
92- sd->sum_closure_weight += weight;
93- sd->flag |= SD_EMISSION;
94- }
95- break;
96- case SD_SUM_CLOSURE_BACKGROUND:
97- if(CLOSURE_IS_BACKGROUND(type)) {
98- const float3 weight = closure_weight * mix_weight;
99- sd->sum_closure_weight += weight;
100- }
101- break;
102+
103+ccl_device_inline bool shader_closure_emission_weight_sum(ShaderData *sd,
104+ float3 closure_weight,
105+ float mix_weight)
106+{
107+ if(sd->sum_closure_flag == SD_SUM_CLOSURE_NONE) {
108+ return false;
109+ }
110+ if(sd->sum_closure_flag != SD_SUM_CLOSURE_EMISSION) {
111+ return true;
112+ }
113+ const float3 weight = closure_weight * mix_weight;
114+ sd->sum_closure_weight += weight;
115+ sd->flag |= SD_EMISSION;
116+ return true;
117+}
118+
119+ccl_device_inline bool shader_closure_background_weight_sum(
120+ ShaderData *sd,
121+ float3 closure_weight,
122+ float mix_weight)
123+{
124+ if(sd->sum_closure_flag == SD_SUM_CLOSURE_NONE) {
125+ return false;
126+ }
127+ if(sd->sum_closure_flag != SD_SUM_CLOSURE_BACKGROUND) {
128+ return true;
129 }
130+ const float3 weight = closure_weight * mix_weight;
131+ sd->sum_closure_weight += weight;
132 return true;
133 }
134
135 CCL_NAMESPACE_END
136
137 #endif /* __BSDF_UTIL_H__ */
138-
139diff --git a/intern/cycles/kernel/osl/background.cpp b/intern/cycles/kernel/osl/background.cpp
140index db31f146867..9d595390f0e 100644
141--- a/intern/cycles/kernel/osl/background.cpp
142+++ b/intern/cycles/kernel/osl/background.cpp
143@@ -38,6 +38,7 @@
144
145 #include "kernel/kernel_compat_cpu.h"
146 #include "kernel/closure/alloc.h"
147+#include "kernel/closure/bsdf_util.h"
148
149 CCL_NAMESPACE_BEGIN
150
151@@ -53,7 +54,7 @@ class GenericBackgroundClosure : public CClosurePrimitive {
152 public:
153 void setup(ShaderData *sd, int /* path_flag */, float3 weight)
154 {
155- if(sum_weight(sd, CLOSURE_BACKGROUND_ID, weight)) {
156+ if(shader_closure_background_weight_sum(sd, weight, 1.0f)) {
157 return;
158 }
159 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_BACKGROUND_ID, weight);
160@@ -71,9 +72,6 @@ class HoldoutClosure : CClosurePrimitive {
161 public:
162 void setup(ShaderData *sd, int /* path_flag */, float3 weight)
163 {
164- if(sum_weight(sd, CLOSURE_HOLDOUT_ID, weight)) {
165- return;
166- }
167 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_HOLDOUT_ID, weight);
168 sd->flag |= SD_HOLDOUT;
169 }
170@@ -89,9 +87,6 @@ class AmbientOcclusionClosure : public CClosurePrimitive {
171 public:
172 void setup(ShaderData *sd, int /* path_flag */, float3 weight)
173 {
174- if(sum_weight(sd, CLOSURE_AMBIENT_OCCLUSION_ID, weight)) {
175- return;
176- }
177 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_AMBIENT_OCCLUSION_ID, weight);
178 sd->flag |= SD_AO;
179 }
180diff --git a/intern/cycles/kernel/osl/emissive.cpp b/intern/cycles/kernel/osl/emissive.cpp
181index c4f7875ac67..8edccd70b2f 100644
182--- a/intern/cycles/kernel/osl/emissive.cpp
183+++ b/intern/cycles/kernel/osl/emissive.cpp
184@@ -39,6 +39,7 @@
185 #include "kernel/kernel_compat_cpu.h"
186 #include "kernel/kernel_types.h"
187 #include "kernel/closure/alloc.h"
188+#include "kernel/closure/bsdf_util.h"
189 #include "kernel/closure/emissive.h"
190
191 CCL_NAMESPACE_BEGIN
192@@ -56,7 +57,7 @@ class GenericEmissiveClosure : public CClosurePrimitive {
193 public:
194 void setup(ShaderData *sd, int /* path_flag */, float3 weight)
195 {
196- if(sum_weight(sd, CLOSURE_EMISSION_ID, weight)) {
197+ if(shader_closure_emission_weight_sum(sd, weight, 1.0f)) {
198 return;
199 }
200 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_EMISSION_ID, weight);
201diff --git a/intern/cycles/kernel/osl/osl_closures.cpp b/intern/cycles/kernel/osl/osl_closures.cpp
202index c3026801b02..0f809d14ccf 100644
203--- a/intern/cycles/kernel/osl/osl_closures.cpp
204+++ b/intern/cycles/kernel/osl/osl_closures.cpp
205@@ -70,11 +70,6 @@ using namespace OSL;
206
207 /* Closure */
208
209-bool CClosurePrimitive::sum_weight(ShaderData *sd, uint type, float3 weight)
210-{
211- return shader_closure_weight_sum(sd, type, weight, 1.0f);
212-}
213-
214 /* BSDF class definitions */
215
216 BSDF_CLOSURE_CLASS_BEGIN(Diffuse, diffuse, DiffuseBsdf, LABEL_DIFFUSE)
217diff --git a/intern/cycles/kernel/osl/osl_closures.h b/intern/cycles/kernel/osl/osl_closures.h
218index df92197d7e9..c6b962b09f9 100644
219--- a/intern/cycles/kernel/osl/osl_closures.h
220+++ b/intern/cycles/kernel/osl/osl_closures.h
221@@ -105,7 +105,6 @@ void name(RendererServices *, int id, void *data) \
222 class CClosurePrimitive {
223 public:
224 virtual void setup(ShaderData *sd, int path_flag, float3 weight) = 0;
225- bool sum_weight(ShaderData *sd, uint type, float3 weight);
226
227 OSL::ustring label;
228 };
229@@ -160,7 +159,7 @@ public: \
230 void setup(ShaderData *sd, int path_flag, float3 weight) \
231 { \
232 if(!skip(sd, path_flag, TYPE)) { \
233- if(sum_weight(sd, CLOSURE_BSDF_TRANSPARENT_ID, weight)) { \
234+ if(shader_closure_bsdf_weight_sum(sd, CLOSURE_BSDF_TRANSPARENT_ID, weight, 1.0f)) { \
235 return; \
236 } \
237 structname *bsdf = (structname*)bsdf_alloc_osl(sd, sizeof(structname), weight, &params); \
238@@ -194,17 +193,16 @@ public: \
239 \
240 void setup(ShaderData *sd, int path_flag, float3 weight) \
241 { \
242- if(TYPE == LABEL_VOLUME_SCATTER) { \
243- if(sum_weight(sd, CLOSURE_VOLUME_HENYEY_GREENSTEIN_ID, weight)) { \
244- return; \
245- } \
246- } \
247- else { \
248- if(sum_weight(sd, CLOSURE_VOLUME_ABSORPTION_ID, weight)) { \
249- return; \
250- } \
251+ if(shader_closure_volume_weight_sum(sd, \
252+ (TYPE == LABEL_VOLUME_SCATTER) \
253+ ? CLOSURE_VOLUME_HENYEY_GREENSTEIN_ID \
254+ : CLOSURE_VOLUME_ABSORPTION_ID, \
255+ weight, \
256+ 1.0f)) \
257+ { \
258+ return; \
259 } \
260- structname *volume = (structname*)bsdf_alloc_osl(sd, sizeof(structname), weight, &params); \
261+ structname *volume = (structname*)bsdf_alloc_osl(sd, sizeof(structname), weight, &params); \
262 sd->flag |= (volume) ? volume_##lower##_setup(volume) : 0; \
263 } \
264 }; \
265diff --git a/intern/cycles/kernel/svm/svm_closure.h b/intern/cycles/kernel/svm/svm_closure.h
266index ff8503ad70e..0b80026ae60 100644
267--- a/intern/cycles/kernel/svm/svm_closure.h
268+++ b/intern/cycles/kernel/svm/svm_closure.h
269@@ -18,9 +18,42 @@ CCL_NAMESPACE_BEGIN
270
271 /* Closure Nodes */
272
273-ccl_device_inline bool svm_node_closure_sum(ShaderData *sd, uint type, float mix_weight)
274+ccl_device_inline bool svm_node_closure_bsdf_sum(ShaderData *sd,
275+ uint type,
276+ float mix_weight)
277 {
278- return shader_closure_weight_sum(sd, type, sd->svm_closure_weight, mix_weight);
279+ return shader_closure_bsdf_weight_sum(sd,
280+ type,
281+ sd->svm_closure_weight,
282+ mix_weight);
283+}
284+
285+#ifdef __VOLUME__
286+ccl_device_inline bool svm_node_closure_volume_sum(ShaderData *sd,
287+ uint type,
288+ float mix_weight)
289+{
290+ return shader_closure_volume_weight_sum(sd,
291+ type,
292+ sd->svm_closure_weight,
293+ mix_weight);
294+}
295+#endif /* __VOLUME__ */
296+
297+ccl_device_inline bool svm_node_closure_emission_sum(ShaderData *sd,
298+ float mix_weight)
299+{
300+ return shader_closure_emission_weight_sum(sd,
301+ sd->svm_closure_weight,
302+ mix_weight);
303+}
304+
305+ccl_device_inline bool svm_node_closure_background_sum(ShaderData *sd,
306+ float mix_weight)
307+{
308+ return shader_closure_background_weight_sum(sd,
309+ sd->svm_closure_weight,
310+ mix_weight);
311 }
312
313 ccl_device void svm_node_glass_setup(ShaderData *sd, MicrofacetBsdf *bsdf, int type, float eta, float roughness, bool refract)
314@@ -75,7 +108,7 @@ ccl_device void svm_node_closure_bsdf(KernelGlobals *kg, ShaderData *sd, float *
315 if(mix_weight == 0.0f)
316 return;
317
318- if(svm_node_closure_sum(sd, type, mix_weight)) {
319+ if(svm_node_closure_bsdf_sum(sd, type, mix_weight)) {
320 return;
321 }
322
323@@ -843,7 +876,7 @@ ccl_device void svm_node_closure_volume(KernelGlobals *kg, ShaderData *sd, float
324 float param2 = (stack_valid(param2_offset))? stack_load_float(stack, param2_offset): __uint_as_float(node.w);
325 float density = fmaxf(param1, 0.0f);
326
327- if(svm_node_closure_sum(sd, type, mix_weight * density)) {
328+ if(svm_node_closure_volume_sum(sd, type, mix_weight * density)) {
329 return;
330 }
331
332@@ -883,14 +916,14 @@ ccl_device void svm_node_closure_emission(ShaderData *sd, float *stack, uint4 no
333 if(mix_weight == 0.0f)
334 return;
335
336- if(svm_node_closure_sum(sd, CLOSURE_EMISSION_ID, mix_weight)) {
337+ if(svm_node_closure_emission_sum(sd, mix_weight)) {
338 return;
339 }
340
341 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_EMISSION_ID, sd->svm_closure_weight * mix_weight);
342 }
343 else {
344- if(svm_node_closure_sum(sd, CLOSURE_EMISSION_ID, 1.0f)) {
345+ if(svm_node_closure_emission_sum(sd, 1.0f)) {
346 return;
347 }
348 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_EMISSION_ID, sd->svm_closure_weight);
349@@ -909,14 +942,14 @@ ccl_device void svm_node_closure_background(ShaderData *sd, float *stack, uint4
350 if(mix_weight == 0.0f)
351 return;
352
353- if(svm_node_closure_sum(sd, CLOSURE_BACKGROUND_ID, mix_weight)) {
354+ if(svm_node_closure_background_sum(sd, mix_weight)) {
355 return;
356 }
357
358 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_BACKGROUND_ID, sd->svm_closure_weight * mix_weight);
359 }
360 else {
361- if(svm_node_closure_sum(sd, CLOSURE_BACKGROUND_ID, 1.0f)) {
362+ if(svm_node_closure_background_sum(sd, 1.0f)) {
363 return;
364 }
365 closure_alloc(sd, sizeof(ShaderClosure), CLOSURE_BACKGROUND_ID, sd->svm_closure_weight);
where i've attempted to reduce branching. Can't
tell it's any better.

Scenemaster timepatch timepatch+tweaks time
bmw2703:00.6202:58.2103:04.66
classroom05:47.6305:57.4906:31.42
fishy_cat07:08.1308:20.0507:57.39
koro07:53.8308:01.1507:57.14
barcelona12:29.6712:48.1114:23.56

So not currently sure how to solve this regression. Maybe just commit as
some ifdef-ed code, unless someone can find/think of a trick to solve
speed regression on AMD cards. Maybe it's cast itself which takes so much
time?

I can confirm the performance issue with my AMD RX 480 here. I made some big modifications to the implementation in P553, summing the emission/absorption/transparency/background for all shader evaluations. The intention is to keep the code in SVM simpler and hopefully avoid the performance regression. Volume rendering code also simplifies a bit.

It seems to avoid the regression with fishy cat, but bmw27 is a little slower too, didn't investigate yet why.

@Brecht Van Lommel (brecht), shall we then close this patch and create a review from your paste? Or maybe you commandeer this revision and replace with your patch. Since the current code from myself is not really a go anyway.

I can also confirm the speed regression on Vega64/win7 with D2249. Where I awaited speedups the most is in heavy scenes like victor.blend. But render time went from 23minutes to 48minutes. I still have to test Brecht's version.

A border render at low resolution of victor takes 1min20 of pure render time with master vs 3min08 with P553 with crimson drivers 17.10.2 on win7 and Vega64.

Edit: redid the test to be sure, it's faser with P553. 59sc vs 1min20. But it seems render time are pretty random on this scene with latest blender. Will double check later

Brecht Van Lommel (brecht) updated this revision to Diff 9506.EditedNov 3 2017, 5:54 PM

Update with patch from P553.

Performance is still ~3% slower with bmw27 and AMD RX 480 here.

Victor goes from 23minutes to 17min40 to render with the patch. However, rendering a second time takes 43minutes, so something goes wrong. I guess the speedup comes from the reduced memory usage and something is not freed properly until Blender is closed.
Rendering after restarting Blender again gives 17min40 the first time and 43minutes the second time.

So it's not random, just dependent on how many times the scene was rendered.

@mathieu menuet (bliblubli), you sure your GPU is not being throttled due to overheat?

Some further results:

sceneslowdown on Vega64/win7
bmw1%
classroom<1%
koro<1%
Barcelona1%

I do all my test with lowered voltage. Temperature never goes above 72°C (max set at 80°C) and frequency is stable. So the timings are also reproducible with differences under 1%.

After testing on several scenes in my library, I only got good speedups for those not fitting in memory and slowdowns under 2% on smaller ones. So on the performance side, the benefit really out-weights the very small slowdowns if any (many scenes were as fast).

Can someone else confirm that with victor scene, the second render is much slower with OpenCL?

The random render times are also in buildbots (but not in 2.79), this patch just make it even more obvious. Render times vary between 29seconds to 102seconds for a small border render of victor. So I reported the bug T53249.

Committed now, performance seems to be within 1% both RX 480 and Titan Xp.

Any effect on out of core rendering is likely quite random, we can use T53249 to investigate the root cause.

This revision was automatically updated to reflect the committed changes.