Page MenuHome

Multithreading Object Sync in Cycles Render
AbandonedPublic

Authored by Jagannadhan Ravi (easythrees) on Jul 17 2020, 4:08 AM.

Details

Diff Detail

Repository
rB Blender

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
intern/cycles/blender/blender_object.cpp
717

Okay, I'll add it to the sync object structure, then.

intern/cycles/blender/blender_particles.cpp
30–32

I didn't want to replace the original method, so just added a new one. I should ask actually, in previous jobs that was the protocol, should I not do that here?

Jagannadhan Ravi (easythrees) marked an inline comment as not done.Jul 20 2020, 11:10 PM
Jagannadhan Ravi (easythrees) added inline comments.
intern/cycles/blender/blender_object.cpp
194–200

Do you think it makes sense to replace the map, vector and set classes with TBB variations? The other thought I had was to add mutexes only where these structures are accessed.

So I figured one way to get sync_geometry(...) better suited for multi-threading is to change its data members to their TBB equivalents:

vector<T *> *scene_data;

map<K, T *> b_map;

set<T *> used_set;
set<void *> b_recalc;

The set members I can replace without issue with tbb::unordered_set, but the vector and map are a little more problematic, there are a lot of compilation errors.

TBB's datatypes will not solve all synchronization problems here. They will help you in a sense that you don't need to worry about synchronization when accessing and adding data to the containers, but this is not enough to properly multi-thread sync_geometry().

One thing to remember here is that multiple objects can share same geometry. The proper solution should ensure that all used geometry is synchronized, and only once.

The idea here could be to allow sync_geometry() to return valid pointer to geometry data, but treat this object as possible placeholder (as in, do not access it until it is known geometry is fully synchronized). This will allow to change sync_geometry() in a way that will add geometry to the IDMap as soon as possible, and do minimal amount of work from inside a lock, Something like:

sync_geometry() {
  mutex.lock();
  if (geometry_object_is_in_map_already) {
    return geoemtry_map[geoemtry_id];
  }
  GeometyrObject* geometry = new GeometyrObject();
  geoemtry_map.insert(make_pair(geoemtry_id, geometry));
  mutex.unlock();

  // Fill in all fields of geometry.

  ...

}

Added a new method sync_geometry_task(...) to serve as the multi-thread friendly version of sync_geometry(...).

Heya, so I've added a new method and it seems to work for the barbershop. The sync used to take 12 seconds for the objects and it looks like it takes much less than that now, I'll do more timing tests tomorrow and get them to you. Do you think you could point me to more scenes with lots of objects that I can use for testing?

Okay, looks like I am missing something here. The Pavilion Barcelona scene doesn't render properly (image attached). Do you know what the trees and lily pad objects are in the scene? Instances?

Follow up on this, it looks like if the sync_task(...) method is inside that main loop that goes through the graph (the original for loop in sync_objects(...)), the missing objects get rendered. They go missing again when outside that loop, which makes me wonder if something is getting stale once we leave that loop. Any ideas?

This is the Pavillon scene that I've stripped down to just the grass and trees. When sync_task(...) is *inside* the loop over b_depsgraph.object_instances, you'll see the trees:

But, when it's outside that loop (and inside its own for (parallel or serial) loop), it looks like this:

I'm not sure what's special about the trees in this scene, it looks like they're a part of a collection in the Blender scene file.

EDIT: Looks like they're instances (the trees), is there something else I need to store? I am grabbing the world matrix, what else do I need?

Just want to confirm that when the sync_task(...) call is inside the main for loop, the image is rendered correctly (instances actually get rendered). So this works:

for (b_depsgraph.object_instances.begin(b_instance_iter);
     b_instance_iter != b_depsgraph.object_instances.end() && !cancel;
     ++b_instance_iter) {

    ...
    sync_task(b_depsgraph, b_view_layer, curr_obj, motion_time, false, show_lights, &use_portal);
 }

But this doesn't:

 for (b_depsgraph.object_instances.begin(b_instance_iter);
       b_instance_iter != b_depsgraph.object_instances.end() && !cancel;
       ++b_instance_iter) {

      ...
}

// Loop to do the sync
  for (SyncObject curr_obj : b_accessible_objects) {
        sync_task(b_depsgraph, b_view_layer, curr_obj, motion_time, false, show_lights, &use_portal);
  }

Any ideas what I could be missing?

Okay, I have moved instance syncing to the serial loop that builds the sync objects. However, in some cases, like the pavilion scene, it reduces the performance. Is there some way to do the sync for instances in parallel?

I'll clean the code up and put it up here, but in the meantime, here are some render times on my machine (average of 5 runs per scene):

Here is a more detailed breakdown of improvements. It looks like Koro suffers a little, maybe there's some circumstances where it's better to do a sync serially. Koro has 10 objects and no instances, maybe that's why?

Can not apply the patch cleanly. Mind updating it against recent master, maybe even using arcanist tool.

It looks like Koro suffers a little, maybe there's some circumstances where it's better to do a sync serially. Koro has 10 objects and no instances, maybe that's why?

10 objects without instances should be perfectly threadable. Is is not impossible that it's a noise floor (2% is rather low difference, can try running th test few times and average the results). Could also be that this is a downside of threaded sync of simple objects. Threading will always have some overhead.

updated diff against latest master

Updated against the latest master.

intern/cycles/blender/blender_object.cpp
166

Something is wrong here. I've got

error: no matching function for call to ‘ccl::id_map<ccl::ObjectKey, ccl::Object>::add_or_update(ccl::Object**, BL::Object&, BL::Object&, ccl::ObjectKey&)’

Second update to master

Hi Sergey, I've updated the review to work with the latest master. I could use your help with something, I'm seeing a crash rendering some scenes (which never happened before), and it's not in my code. I can see it in the BVH code, but I'm not sure how it's related to my code. Do you think you could also take a look and see what I could be missing?

EDIT: I think I found it. I'll have something shortly.

Fixed crash as a result of sync to latest.

I've added timer information in rB9a5663ff434 so its easier to see the speedup.

Was using it in the following way: blender --debug-cycles 2>&1 | grep "synchronizing data". On barbershop_interior.blend time wend down from 11.7846 sec to 1.73905. This is rather impressive :)

Unfortunately, the following regression tests are failing with this patch:

1023 - cycles_denoise (Failed)
1026 - cycles_hair (Failed)
1033 - cycles_mesh (Failed)
1034 - cycles_motion_blur (Failed)
1038 - cycles_shader (Failed)

Hopefully they can be fixed without performance impact.

Hi @Sergey Sharybin (sergey), I fixed one regression, but can't find the cause of the others. I'm going to rewrite this change and run the tests as I go along. If anything in the change strikes you as being a culprit, please let me know (this codebase is huge!).

Hi @Sergey Sharybin (sergey) this change should work. I noticed with the cycles_shader test that it "failed" on the tex_voronoi example, but the images in the web page look exactly the same. What do you think? Can this go in?

Hi @Sergey Sharybin (sergey) this change should work. I noticed with the cycles_shader test that it "failed" on the tex_voronoi example, but the images in the web page look exactly the same. What do you think? Can this go in?

This is the result i'm having with the voronoi test:

Not sure what could be causing the difference, but it only happens with this patch applied (current master is passing the test). To me it doesn't seem to be expected, so needs an investigation and verdict before the change can go in.

Hi @Sergey Sharybin (sergey), so I looked at this and noticed a a few things:

  • The difference disappears when sync_task(...) is inside the main loop, *or* if it's a serial loop outside the main loop.
  • If you bump the sample count higher in the scene file, the differences seem to go away.

Could this be just noise that's random (since the reference image also has noise) and multi-threaded only makes it different, not more or less? What do you think?

Noise is not random, is a seeded hash.

Try to run the sync_task from a single thread, but in the random order. Just to understand whether it is order of objects/meshes in Cycles which causes the noise pattern difference or whether there is some obscure threading conflict going on.

Hi @Sergey Sharybin (sergey), with this loop (serial, randomized):

std::random_device rd;
std::mt19937 g(rd());

std::shuffle(b_accessible_objects.begin(), b_accessible_objects.end(), g);
for (SyncObject curr_obj : b_accessible_objects) {
  sync_task(b_depsgraph, b_view_layer, curr_obj, motion_time, false, show_lights, &use_portal);
}

The failure happens (see below)

So I don't think it's a threading conflict.

Hi, I hope you don't mind me sneaking in : I was myself considering trying to speed up the 'object sync', and I found your nice ongoing effort.

FWIW, I'm not used to Blender code base, so I learn with this - but I could build & apply the patch here, and reproduce the Voronoi difference.

Please let me know if I can be of any assistance.

Hi @Nicolas Lelong (rotoglup) thank you for reaching out and I'm very flattered you felt that you could learn from my work (I'm also new at this). If you look at my post above yours, the issue seems to be that the order of the noise generation (which is a seeded hash) is the issue. @Sergey Sharybin (sergey)'s contention is exactly right, this isn't a threading conflict, but "just" an issue of ordering. I'm waiting for Sergey to chime in (though he's in the UTC +2 time zone so it may be a while).

What I think is happening is that there are two emissive objects in this test, which will be included in the light distribution for importance sampling. If they are added to the distribution in a different order, the resulting noise will be different.

A way to solve this would be to change LightManager::device_update_distribution, to gather all emissive object pointers in a vector and sort them by object name.

What I think is happening is that there are two emissive objects in this test, which will be included in the light distribution for importance sampling. If they are added to the distribution in a different order, the resulting noise will be different.

A way to solve this would be to change LightManager::device_update_distribution, to gather all emissive object pointers in a vector and sort them by object name.

Hi Brecht, should it matter what the noise pattern looks like? I’m wondering why it counts as a test failure.

Hi Brecht, should it matter what the noise pattern looks like? I’m wondering why it counts as a test failure.

It matters, the noise pattern must always be the same. We need it to be able to run tests reliably, and for production rendering it's needed for predictability and distributing renders across multiple computers as well.

intern/cycles/blender/blender_geometry.cpp
43–50

Note this part of the code is out of sync with sync_geometry in latest master, it needs to use determine_geom_type().

Since the only thing that's different in this function compared to sync_geometry is the mutex lock, that can just be added to sync_geometry and we can remove this code duplication.

54

thread_scoped_lock lock(sync_mutex); should be used instead of lock() / unlock() calls whenever possible, and I think it can be used for all mutex usage in this patch.

125–127

This needs to be in the sync_mutex lock, it's not safe otherwise.

That would only be the case if find() and insert() together were a single atomic operation.

intern/cycles/blender/blender_object.cpp
36–37

Don't include TBB directly, use util/util_tbb.h which pulls the TBB functions into the Cycles namespace so you don't have to use the tbb:: prefix.

Also, includes for external libraries like this should use <> rather than "".

101

For this patch to be ready to commit, this function and sync_dupli_particle also need to be deduplicated and simply always use SyncObject.

I'm not sure if you are keeping it this way temporarily for the patch, but for final code review this needs to be solved.

155–157

This mutex lock should move into sync_geometry_motion similar to sync_geometry.

710

Here it should check progress.get_cancel() to stop exporting objects when rendering gets cancelled.

intern/cycles/blender/blender_sync.h
288–290

To me it seems that with the current implementation, all access to these still has to be mutex protected.

So then there is also no point in use concurrent data structures. I would consider that a potential future optimizations, and just leave these as a regular set in this patch.

What I think is happening is that there are two emissive objects in this test, which will be included in the light distribution for importance sampling. If they are added to the distribution in a different order, the resulting noise will be different.

A way to solve this would be to change LightManager::device_update_distribution, to gather all emissive object pointers in a vector and sort them by object name.

Hi @Brecht Van Lommel (brecht), do you mean sort the vectors of thing like the lights? They're under the scene object.

Hi @Brecht Van Lommel (brecht), do you mean sort the vectors of thing like the lights? They're under the scene object.

I'm referring to all lights and meshes that will be put in the distribution in that function.

This would not sort the lights or objects in the scene itself, but rather just create a new vector with pointers to those, and then sort them for creating the light distribution.

intern/cycles/blender/blender_object.cpp
652

This seems to skip multithreading for instances, maybe to reduce threading or memory usage overhead?

However, it's quite common for some geometry to only every be instanced, without it being directly in the scene. In fact some production scenes might consist entirely of instanced objects, and so this would multithreading completely.

Further, sync_object does not contain a mutex lock and so is unsafe when it runs at the same time as sync_task.

I suggest removing this exception entirely. If it's a performance problem we can find some alternative solution, but this one doesn't work I think.

Hi, I'm have the feeling that this feature/patch could be simplified. On my side, I've given a try to an approach similar to what @Sergey Sharybin (sergey) mentionned in D8230 - which seem to work fine - the general idea is :

  • leave the sync_objects loop as is
  • push the geometry sync tasks in a TaskPool - making sure to use a BL::Object that has the good lifespan
  • wait for the task pool to finish

My code is visible on github for now : https://github.com/rotoglup/blender/pull/1/files

I may be missing something obvious, and this is perhaps not the best place to mention this ? Thanks.

Jagannadhan Ravi (easythrees) marked an inline comment as done.Oct 5 2020, 6:15 PM

Hi, I'm have the feeling that this feature/patch could be simplified. On my side, I've given a try to an approach similar to what @Sergey Sharybin (sergey) mentionned in D8230 - which seem to work fine - the general idea is :

  • leave the sync_objects loop as is
  • push the geometry sync tasks in a TaskPool - making sure to use a BL::Object that has the good lifespan
  • wait for the task pool to finish

My code is visible on github for now : https://github.com/rotoglup/blender/pull/1/files

I may be missing something obvious, and this is perhaps not the best place to mention this ? Thanks.

Have you tested this on the benchmark scenes and observed performance improvements?

Have you tested this on the benchmark scenes and observed performance improvements?

I just made a quick test, but I find that the "synchronizing" step in the benchmark scenes is already very fast, I don't know how I could do precise measurements of improvements - how do you do it ?

On my own test scene :

  • I initially have a sync duration of about 9 seconds (with *master* code),
  • I get down to ~3 seconds with my patch

So there's improvement, but I'm not sure how I could measure the "synchronizing" part better, as I'm eyeballing the status bar during rendering.

On the same grounds, it seems that this patch is a little "slower" (by 1 second) - maybe because I have a scene with many instances (for which multithreasing is skipped ATM).

Hi, I'm have the feeling that this feature/patch could be simplified. On my side, I've given a try to an approach similar to what @Sergey Sharybin (sergey) mentionned in D8230 - which seem to work fine - the general idea is :

  • leave the sync_objects loop as is
  • push the geometry sync tasks in a TaskPool - making sure to use a BL::Object that has the good lifespan
  • wait for the task pool to finish

My code is visible on github for now : https://github.com/rotoglup/blender/pull/1/files

I may be missing something obvious, and this is perhaps not the best place to mention this ? Thanks.

Have you tested this on the benchmark scenes and observed performance improvements?

I just stick in timing code.

I just stick in timing code.

OK, I did it this, too. I attach my results on 2 scenes (benchmark's classroom + my own).

Jagannadhan Ravi (easythrees) marked an inline comment as done.Oct 5 2020, 11:16 PM

I just stick in timing code.

OK, I did it this, too. I attach my results on 2 scenes (benchmark's classroom + my own).

Do you think you could put your change and timing results in another review? That will make things easier to track. I can also go in detail on there about what I do for timing methodologies and results.

Do you think you could put your change and timing results in another review?

I sure could, would that be OK with you @Sergey Sharybin (sergey) and @Brecht Van Lommel (brecht) to have another review about the same feature ? What would be easiest for you ?

Hi @Brecht Van Lommel (brecht), do you mean sort the vectors of thing like the lights? They're under the scene object.

I'm referring to all lights and meshes that will be put in the distribution in that function.

This would not sort the lights or objects in the scene itself, but rather just create a new vector with pointers to those, and then sort them for creating the light distribution.

Hi @Brecht Van Lommel (brecht) I think I need some more clarification on this. I did this:

  vector<Object *> emissives_objects;
  foreach (Object *object, scene->objects) {
    if (progress.get_cancel())
      return;

    if (!object_usable_as_light(object)) {
      continue;
    }

    emissives_objects.push_back(object);
  }

  std::sort(emissives_objects.begin(), 
            emissives_objects.end(), 
            [](Object *a, Object *b) -> bool {return a->name < b->name;});

  foreach (Object *object, emissives_objects) {
    if (progress.get_cancel())
      return;

    if (!object_usable_as_light(object)) {
      j++;
      continue;
    }
    ...
}

And I did something similar for the lights loop as well. Sorting alphabetically doesn't seem to help, am I doing this right?

And I did something similar for the lights loop as well. Sorting alphabetically doesn't seem to help, am I doing this right?

The emissive objects are not alphabetically ordered in the test file, so it is bound to be still different. Accidentally, they are added in reverse alphabetical order:

Sphere.003
Sphere.002
Sphere.001
Sphere

If you add an Emission shader on the background plane, the objects will be added in this order:

Sphere.003
Sphere.002
Sphere.001
Plane
Sphere

This change should be done in master first I think.

And I did something similar for the lights loop as well. Sorting alphabetically doesn't seem to help, am I doing this right?

The emissive objects are not alphabetically ordered in the test file, so it is bound to be still different. Accidentally, they are added in reverse alphabetical order:

Sphere.003
Sphere.002
Sphere.001
Sphere

If you add an Emission shader on the background plane, the objects will be added in this order:

Sphere.003
Sphere.002
Sphere.001
Plane
Sphere

This change should be done in master first I think.

So sorting them by name won't help it sounds like, correct? Also, can you explain your second comment? What should be done in master?

So sorting them by name won't help it sounds like, correct?

Yes, it won't help making the test pass.

Also, can you explain your second comment? What should be done in master?

I was referring to sorting the objects by name.

So sorting them by name won't help it sounds like, correct?

Yes, it won't help making the test pass.

Sorting will solve the problem. What you need to do is update the test reference image and then verify that multithreading the test always passes, rather than randomly failing depending on the object order.

I sure could, would that be OK with you @Sergey Sharybin (sergey) and @Brecht Van Lommel (brecht) to have another review about the same feature ? What would be easiest for you ?

Can either if you run the same performance tests with both patches and post the numbers? Then we can decide if the simpler approach is enough or not, and which patch we should be reviewing and helping complete.

In general, the more we can multithread the better, so if the threading overhead of doing it at the object level is not too bad, to me that is still promising. Also because we can continue optimizing that in the future, especially if we can get the Blender side of instancing multithreaded as well it will matter.

Can either if you run the same performance tests with both patches and post the numbers? Then we can decide if the simpler approach is enough or not, and which patch we should be reviewing and helping complete.

I can do that if it's ok with you @Jagannadhan Ravi (easythrees), I think I'll proceed as follows :

  • instrument the code to temporarily add printing of timing values :
    • time spent in serial objects loop
    • time spent in waiting the end of the parallel execution
    • total time spent in sync_objects
  • instrument both patches, as "after" results, and my current master branch, to have a "before" reference
  • gather data from the scene in https://svn.blender.org/svnroot/bf-blender/trunk/lib/benchmarks/cycles/
    • I'm not able to share my "big" scene that I used previously, but I saw that barbershop_interior scene is quite heavy in the synchronizing step

So sorting them by name won't help it sounds like, correct?

Yes, it won't help making the test pass.

Sorting will solve the problem. What you need to do is update the test reference image and then verify that multi-threading the test always passes, rather than randomly failing depending on the object order.

Okay, I guess I need to update the reference image, then.

Can either if you run the same performance tests with both patches and post the numbers? Then we can decide if the simpler approach is enough or not, and which patch we should be reviewing and helping complete.

I can do that if it's ok with you @Jagannadhan Ravi (easythrees), I think I'll proceed as follows :

  • instrument the code to temporarily add printing of timing values :
    • time spent in serial objects loop
    • time spent in waiting the end of the parallel execution
    • total time spent in sync_objects
  • instrument both patches, as "after" results, and my current master branch, to have a "before" reference
  • gather data from the scene in https://svn.blender.org/svnroot/bf-blender/trunk/lib/benchmarks/cycles/
    • I'm not able to share my "big" scene that I used previously, but I saw that barbershop_interior scene is quite heavy in the synchronizing step

What I'd recommend is wrapping std::chrono calls around SyncObjects(...) and measure the serial and parallel performances, You can also use @Sergey Sharybin (sergey)'s patch (it's mentioned somewhere in the history) and test on the scenes in provided in the Blender Benchmark. I would also highly recommend doing an average of five or so runs, and make sure to have a time gap in between (I use about five minutes) since that way it's a "cold" test each time. This review doc should also have an image with all the timings I had with my change. You may also want to look at the output images to make sure it matches with the reference images, and also run the regression tests.

So sorting them by name won't help it sounds like, correct?

Yes, it won't help making the test pass.

Sorting will solve the problem. What you need to do is update the test reference image and then verify that multi-threading the test always passes, rather than randomly failing depending on the object order.

Okay, I guess I need to update the reference image, then.

Okay, in talking to "LazyDodo" on the "blender-coders" group, the reference images need to be updated when this diff gets merged in. I'll look into the other changes now.

I was a little more nuanced than that, i said as long as you can justify the changes, and the images are consistent between renders, you'll be ok, updating the ref images is not that big of a deal, however those two conditions do have to be met.

I was a little more nuanced than that, i said as long as you can justify the changes, and the images are consistent between renders, you'll be ok, updating the ref images is not that big of a deal, however those two conditions do have to be met.

Fake news!! (sorry, this past week has been the longest year). Yeah, the suggestion to update the test image was made in the history above, sorry about the lack of context there.

What I'd recommend is wrapping std::chrono calls around SyncObjects(...) and measure the serial and parallel performances, [...]

I've run the benchmark scenes with the 3 builds ('master', this patch ('parallel_for'), and my attempt at simplifying ('task_pool') :

  • IMHO, the overall performances are comparable between the 2 patches
  • For most of the scenes (except victor), the object iteration loop - without the geometry processing - is very fast (with a penalty on bmw27 for 'parallel_for' perhaps due to the mutex protecting instances)
  • barbershop_interior show the most timing improvement from threading (80% speedup, from 30sec. to ~6sec)

I've also checked the renderings I get from my 'task_pool' attempt, which match the renderings from 'master', and all regression tests pass.

I attach here various elements :

  • my task_pool_attempt patch
  • a spreadsheet containing the measures I made
  • a dxdiag result for my machine (which ran in a 'silent' power plan)
  • the batch I used to render the images

(As an aside, I built all the variants using MSVC 2019 version 19.27.29111)




In general, the more we can multithread the better, so if the threading overhead of doing it at the object level is not too bad, to me that is still promising.

FWIW, I agree, but I'm concerned with fine grained synchronization, or over-use of 'concurrent' containers, which IME can cause difficult to track issues. Perhaps there are refactorings to do before being able to have a simple solution to multithread at the object level ?

Given these timing results, I think we should go with the simpler patch then. Perfomance seems to be a little better, and it avoids increasing memory overhead when there are millions of instances. I suggest to submit a new code review for that.

Unfortunately it means some of the work in this patch will not be used then, but in the end it's about what the best solution is.

In the task pool patch, I think there is still some non-thread safe code in BlenderSync::sync_geometry. It's accessing geom->transform_applied and geom->used_shaders in the main thread while a task pool thread may also be modifying them. Moving the geometry_synced test and insert earlier in the function or duplicating it (haven't checked which is correct) would avoid the issue.

b_permanent_ob is also the same as b_ob_instance as far as I can tell (maybe it needs to be renamed for clarity).

Given these timing results, I think we should go with the simpler patch then. Perfomance seems to be a little better, and it avoids increasing memory overhead when there are millions of instances. I suggest to submit a new code review for that.

Fine with me, @Jagannadhan Ravi (easythrees) would you be willing to submit that new review ?

Jagannadhan Ravi (easythrees) marked an inline comment as done.Oct 9 2020, 6:52 PM

Given these timing results, I think we should go with the simpler patch then. Perfomance seems to be a little better, and it avoids increasing memory overhead when there are millions of instances. I suggest to submit a new code review for that.

Fine with me, @Jagannadhan Ravi (easythrees) would you be willing to submit that new review ?

I'd be happy to, and thank you for your work on this. Do you think you could point me to it again please? I'd like to run it in our lab here to get some wider benchmarks.

I'd be happy to, and thank you for your work on this. Do you think you could point me to it again please? I'd like to run it in our lab here to get some wider benchmarks.

Great ! You can either grab the patch file attached to D8324#226209, or get it from https://github.com/rotoglup/blender/pull/1/files

I'd be happy to, and thank you for your work on this. Do you think you could point me to it again please? I'd like to run it in our lab here to get some wider benchmarks.

Great ! You can either grab the patch file attached to D8324#226209, or get it from https://github.com/rotoglup/blender/pull/1/files

Just an update, I've integrated the changes and will need to run tests on my end to get some numbers on our lab. I'll update in a new review when I have something.

Given these timing results, I think we should go with the simpler patch then. Perfomance seems to be a little better, and it avoids increasing memory overhead when there are millions of instances. I suggest to submit a new code review for that.

Unfortunately it means some of the work in this patch will not be used then, but in the end it's about what the best solution is.

In the task pool patch, I think there is still some non-thread safe code in BlenderSync::sync_geometry. It's accessing geom->transform_applied and geom->used_shaders in the main thread while a task pool thread may also be modifying them. Moving the geometry_synced test and insert earlier in the function or duplicating it (haven't checked which is correct) would avoid the issue.

b_permanent_ob is also the same as b_ob_instance as far as I can tell (maybe it needs to be renamed for clarity).

I am little unsure about how this should be handled, but I'll make a new review and we can discuss in there.