Page MenuHome

LineArt: Speedup construction of quad trees
ClosedPublic

Authored by YimingWu (NicksBest) on May 16 2022, 7:53 AM.

Details

Summary

Using multithread for add_triangles to speed up quad tree building. Each thread would lock its respective tile to work on triangle insertion, intersection calculation, tile splitting and triangle array extension.

Originally this patch is using atomic "compare and swap" to get around the lock and try to insert more triangles in to the quad tree in shorter time, however after extensive trial and analysis, we found out that using a more traditional locking approach as opposed to using cas to try to avoid locking is better in terms of thread scheduling, and a cas with "busy wait" isn't really a lock-free algorithm. So we added the lock back and based on the test that Sergey did, it's proven to be more robust and reasonably scalable. The overall logic and code path is the same as cas method described below, it's just we decide to not use cas to synchronize thread data.

Further optimization would need to overhaul the 2d acceleration structure, which may include stuff like fixed grid. This is for future experimentation.


Old information on cas algorithm for context

Using the atomic "compare and swap" method in add_triangles stage dramatically speeds up the add_triangles call and significantly reduced the overall calculation time. This is currently the fastest method we have experimented with so far.

Note: Needs D15020 to compile!

This performance graph is ran on a machine with this setup

  • CPU: Intel 4910MQ (8 threads)
  • RAM: 32GB DDR3L
  • Xubuntu 20.04 LTS

How it works

Lineart needs to build a 2D quad tree and link triangles into it to make sure occlusion/intersection are calculated efficiently without going through a lot of iterations, similar to what a BVH does in 3D. LineartBoundingArea is the 2D quad tree node.

Originally Lineart uses single thread to build this quad tree, in temp-lineart-contained it actually used a per-node lock to achieve multithread insertion, but locking the node heavily can lead to inefficiency when multiple threads are inserting into the same tile. After researching for different methods, me and @Sebastian Parborg (zeddb) found out that the fastest way currently (on our machines) is to use atomic compare and swap to handle inserting, this way we spend way less time in locks, we only lock a specific tile when we need to split that tile or we filled up the triangle array inside that tile. The "per-node lock" algorithm was never integrated into master.

To briefly describe how the quad tree is built:

  1. There will be a initial tile of ~10x10, each tile has a limit of 100 triangles, more than that it will need to split into 4 child ones, or (if at max depth) extend the array.
  2. Threads each have a batch of triangles to be inserted into the tile (lineart_bounding_area_link_triangle_cas).
  3. When inserting into a tile, use cas (compare and swap) to get a unique index and assign the triangle into the array.
  4. If the tile is full, lock the tile and split (or extend the array), when done, release the lock.
  5. After inserting a triangle, the thread would also test this triangle with other triangles that are inserted prior to this triangle for intersections.
  6. Repeat until all triangles are inserted.
  7. Generate intersection edges from thread intersection results.
  8. Relink tile and its adjacent tiles for convenience of later occlusion stage edge marching.

Here is a logic flow to show how lineart_bounding_area_link_triangle_cas under the hood, this way it's easier to get a general idea of what the insertion function is doing.

Other stuff

Additional changes includes moving some of other auxiliary calls (like relink adjacent tiles) outside of what is originally inside the triangle insertion call, this way we give the max performance for triangle insertion.

Notes

Notes on the cas and atomic_load/store calls, and why we only use that in specific places:

  • The reason for using cas instead of atomic_add for getting an index (for triangle insertion) is that when we have multiple threads incrementing the index, it could overflow the array length limit. We need "compare and swap" done in atomic so we can ensure the updated value is always the latest.
  • The only two places that atomic_load_z is used is when the thread needs to read addresses that are written by (likely) other threads. Those are the tile splitting function (understandably, the tile contains all the triangles that different threads inserted into), and tri-tri intersection function (which will compare previously added triangles for intersection up to its own index, which will also contain triangles inserted by other threads).
  • The only place atomic_store_z is used is when thread is writing a triangle address to the linked_triangles array, it will need to make sure the write is visible to all other threads.

Notes on the "busy wait":

  • We need to try atomic_load_z the linked triangle pointer repeatedly if it's NULL, Because at that state, other threads are just finishing up assigning those last few triangles into the array, they would be done very quickly (in the case of tile-splitting, at most we only need to wait for the last thread_count amount of indices), so it should not be a big problem and I think this is a reasonable way of satisfying the quirk of this algorithm.

Diff Detail

Repository
rB Blender

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Clarify stuff with comments

This revision is now accepted and ready to land.May 23 2022, 6:24 PM
This revision was automatically updated to reflect the committed changes.
Ray Molenkamp (LazyDodo) reopened this revision.EditedMay 23 2022, 9:14 PM

I had to revert this as it caused build errors on mac X64, mac ARM and newer versions of MSVC where it actually has include <stdatomic.h> but does not support building in C mode.

Also i'm not entirely thrilled with some atomics code being squirreled away near the bottom of a file called source/blender/gpencil_modifiers/intern/lineart/lineart_intern.h please consult with @Sergey Sharybin (sergey) to devise a more general purpose solution.

We have intern/atomic/atomic_ops.h for atomics in C, please use that since it has been tested to work well on all platforms.

Using atomic_add_and_fetch_z when all you need is a memory fence is of course going to be slower. If you really need C atomic load and store, the correct solution would be to extend intern/atomic instead of writing something specific for line art.

However I would recommend just using C++ instead of figuring out how to do this in a cross platform. Because there's a chance that even if we get this right for our platforms, someone building packages for a Linux distributions will notice it breaking on some exotic processor architecture.

Put MSVC memory barrier code in front to bypass the incomplete C support of <stdatomics.h> on MSVC. Decided to keep the atomic stuff inside line art because after the shadow functionality integration I'll probably move it to C++ anyway, so spared the change to intern/atomics.

Fixed _Atomic qualifier so it should compile on Apple just fine. Missed this change when I committed it initially.

Fix a bit comments

Included the lambda change in lineart_sort_adjacent_items to prevent warnings on windows.

Sergey Sharybin (sergey) requested changes to this revision.May 24 2022, 9:45 AM

From the current patch description I'd expect the change to be as simple as a replacement of locks with CAS. In reality the patch does much much more, without proper presentation and explanation of patches, and without providing design of the threading model chosen. Within the current state of the patch is not possible to properly review it, or maintain after it is committed. We have Ingredients of a Patch for a reason.

The benchmark does not include information about platform it was run on. With such sort of changes it is not uncommon to see a speedup on a lower-end CPUs and have a slowdown on a CPUs with 32+ threads. So, what is the platform was used for benchmark, and did you verify that performance on a system like TR still benefits from the change?

source/blender/gpencil_modifiers/intern/lineart/lineart_cpp_bridge.cc
12 ↗(On Diff #51773)

To me this seems to be a cleanup of some sort which is not related on speedup.

15 ↗(On Diff #51773)

I believe capture here is wrong.

source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
3215–3229

Seems unused?

3237–3239

Under which circumstances the thread is a null pointer?

3788–3792

Such busy wait is at best a very dangerous way of synchronizing threads. Need of it might also be indicative of a bad threading model chosen for the algorithm.

3927–3929

Such pattern could also be an indicative of a mistake in the threading model.

  • What if the splitting thread did not acquire the lock prior to the BLI_spin_lock here?
  • What if the splitting thread started after the BLI_spin_unlock here?
3939–3944

Comments are full sentences: start with a capital, end with a full-stop.

4253

Such pointer magic is not something I'd expect to see in a modern code without a clear explanation why this is needed.

Why the over-allocation of LineartTriangle is used instead of a more strongly typed arrays?

4381

For printf the portable way is to use %f.

source/blender/gpencil_modifiers/intern/lineart/lineart_intern.h
98

Atomic utilities should be added to atomic_ops.h

This revision now requires changes to proceed.May 24 2022, 9:45 AM
YimingWu (NicksBest) marked 4 inline comments as done.May 24 2022, 1:12 PM
YimingWu (NicksBest) added inline comments.
source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
3215–3229

I'll remove that. Deprecated code.

3237–3239

When splitting tiles and relinking existing triangles into new tiles, this prevent triangles from doing intersections against triangles that are already tested.

3927–3929

old_ba->child is set by lineart_bounding_area_splitcall,, so if a thread reaches here, another thread is already doing (or has done) splitting, they lock it before lineart_bounding_area_split call, it's impossible that one thread is doing the splitting while the lock is not set. It's also not possible that splitting thread starts after this unlock because if old_ba->child!=NULL then it's either already splitting or it's done splitting.

Having old_ba->child != NULL doesn't mean splitting is complete, so here it waits until the splitting thread unlocks it which means this thread can continue inserting.

4253

It was designed this way so occlusion stage can have multiple threads storing "done" edges onto one triangle, which get rid of the lock.

The size here is sizeof(LineartTriangle) + sizeof(LineartEdge) * num_threads. See LineartTriangleThread for details.

YimingWu (NicksBest) edited the summary of this revision. (Show Details)May 24 2022, 1:43 PM

Question which is applicable in several places of the patch. There seems to be a mixed approach to reading of fields which are modified in an atomic manner: some of the places will read the value directly, others will use lineart_atomic_load. Why is it so?

source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
3153

Remove the semi-colon. In the current code it leads to a redundant semicolon, but it could also potentially cause a real syntax error.

3237–3239

From my understanding in that case the do_intersection in some of the callers will be false.

Having such a safety return could be OK, but it at least needs an assert or something similar to indicate that it is not an intended situation to happen.

3927–3929

Then the question is: why old_ba->child is assigned to something which is only partially initialized?

3950–3955

This needs an explanation why a simple atomic increment is not usable here.

YimingWu (NicksBest) marked 4 inline comments as done.May 24 2022, 5:21 PM

some of the places will read the value directly, others will use lineart_atomic_load. Why is it so?

The "direct read" addresses are all written by the same thread, which is ensured by cased index, no other thread can cas that exact index.

Only two places it will need to read addresses that are written by (likely) other threads, the tile splitting function (understandably, the tile contains all the triangles that different threads inserted into), and tri-tri intersection function (which will compare previously added triangles for intersection up to its own index, which will also contain triangles inserted by other threads).

source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
3237–3239

I'll optimize that.

3927–3929

I think I can assign that as a final step, that should remove the need for the lock-unlock combo, But I'll check first.

3950–3955

That could lead to multiple threads incrementing at the same time and leading to overflow. We need "compare and add" done in atomic. I'll put that in the comment in the next update.

YimingWu (NicksBest) marked 8 inline comments as done.May 25 2022, 8:07 AM
YimingWu (NicksBest) updated this revision to Diff 51817.EditedMay 25 2022, 8:12 AM

Updated to fix stuff pointed out in comments.

Now we don't need that "lock-unlock" pattern. (Which I found out to have made the thing run a bit faster)

YimingWu (NicksBest) edited the summary of this revision. (Show Details)May 25 2022, 8:43 AM

Updated to latest master with atomic call type casting corrected.

The title should be something more in lines of: LineArt: Speedup construction of quad trees
The CAS is more of an implementation detail, and it is a very small fraction of the overall change.

I am still not sure about the busy wait. What guarantees that one thread don't reallocate linked_triangles while another thread is waiting for all elements to be non-null?

source/blender/gpencil_modifiers/intern/lineart/MOD_lineart.h
442

The CAS instruction is also defined for 16bit integers. So not sure the comment clarifies anything.

Can as well be a fixed-size int32 matching, say, Mesh::totvert.

source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
3204

it's -> its

3257

sizeof(size_t) might be different from sizeof(void*).
An atomic_load_ptr is required (see how atomic_cas_ptr is implemented as an example).

3796

Can not use atomic_load_z for pointers.

3979

atomic_store_z used for pointer.

YimingWu (NicksBest) marked 2 inline comments as done.May 31 2022, 8:58 AM

What guarantees that one thread don't reallocate linked_triangles while another thread is waiting for all elements to be non-null?

It is because a thread is only going into the splitting/reallocating branch if it sees the array is exactly full, and when the array is not full it will not wait for any synchronization, it will only insert and done. So the only threads waiting for "non-null" are the threads that find out tile is full, _all_ of which will go into the "full" branch, and whoever first acquires that old_ba->lock does the thing, and every one else (who found a full index) would be waiting at the lock until this one thread unlocks, while other threads who is still doing the insertion is not interfered because they are not going in that "if full" branch.

And once the worker thread unlocks, all other threads waiting would see either old_ba->child!=NULL or max_triangles < old_ba->max_triangle_count, so none of them goes into actually splitting/reallocating, they simply insert the triangle again on the current tile (if child exists, the function simply inserts directly into child recursively and return).

source/blender/gpencil_modifiers/intern/lineart/MOD_lineart.h
442

Humm there's no atomic_cas_uint16 ones in there. I think I'll just remove the comment, it doesn't matter.

YimingWu (NicksBest) marked an inline comment as done.

Fixed commented stuff

YimingWu (NicksBest) retitled this revision from LineArt: Using CAS for add_triangles to LineArt: Speedup construction of quad trees.May 31 2022, 9:03 AM

It is because a thread is only going into the splitting/reallocating branch if it sees the array is exactly full, and when the array is not full it will not wait for any synchronization, it will only insert and done.

Check for whether array is full and writing to the array is not an atomic operation.

So imagine the following scenario of lineart_bounding_area_link_triangle_cas called from 2 threads:

  • Thread1 goes into the true branch of the if (old_tri_index < max_triangles) {, successfully modifies old_ba->triangle_count and the triangle_count is now at maximum.
  • Thread1 continues for until after getting an address of &old_ba->linked_triangles[old_tri_index] and goes to sleep.
  • Thread2 goes into the false branch of the if (old_tri_index < max_triangles) {, acquires the lock, calls lineart_bounding_area_triangle_reallocate and goes to sleep right before releasing the spin.
  • Thread1 wakes up, calls atomic_store_ptr with a pointer it previously calculated, but the address is now pointing to somewhere inside of a freed block.
source/blender/gpencil_modifiers/intern/lineart/lineart_cpu.c
354–356

Seems that we can use MEM_recallocN?

YimingWu (NicksBest) marked an inline comment as done.May 31 2022, 1:33 PM

Thread2 goes into the false branch of the if (old_tri_index < max_triangles) {, acquires the lock, calls lineart_bounding_area_triangle_reallocate and goes to sleep right before releasing the spin.

Oh I get it. lineart_bounding_area_triangle_reallocate *does* also need a wait as well, just like the ones inside splitting (not sure why I missed it 🤔, that was the intention).

Added the wait inside the reallocating function, and used MEM_recallocN for convenience.

Tested to be working as it should.

I think the probable reason why it didn't cause troubles in previous tests is that none of my testing scenes are that dense that could build up to the max depth of the quad tree.

With all those waits in the code I really do not see how it is different from holding a spin lock for a bit "longer". This will allow to simplify code, follow more traditional resource protection in a threaded environment, and also allow OS to do proper scheduling decisions (since spin lock implementation is typically gives proper hints to the OS). Here is a patch to demonstrate what i mean: P2991

We've run it here on a MacPro with 16 cores CPU and on a threadripper machine with 24 cores. The numbers are:

masterD14953D14953+P2991
TR0.8458560.1431750.135748
MacPro0.9211280.1139980.110107

This is quite a interesting result. If I'm understanding it correctly, This basically means "when one triangle is being inserted to a tile, lock this entire tile until insertion/tri-tri intersection/splitting/extending is all done". This method is basically the old line art branch threading method, only now it's written differently. The conclusion me and @Sebastian Parborg (zeddb) came up with is that we need to "not lock the tile that long", and "use cas to quickly get indexes without locking", which proves to be observably faster than this locking method. I'm gonna test this and see how it performs now on my machine. Very weird you got it faster.

YimingWu (NicksBest) added a comment.EditedJun 1 2022, 2:29 PM

Could you try this file? it hangs forever. Looks like you need to unlock at some places...

Also, the 727 file doesn't differ that much indeed when using CAS or locking, but the mr_elephant file does differ a lot

The conclusion me and @Sebastian Parborg (zeddb) came up with is that we need to "not lock the tile that long", and "use cas to quickly get indexes without locking"

Spin lock is basically an integer variable, which is manipulated in an atomic manner. You can get an idea how it works from the fall-back Windows implementation of BLI_spin_lock/BLI_spin_unlock in thread.scc.

CAS implies a memory barrier. Using CAS in a loop to achieve a "clamping" increment is not much different from using a spin lock around a naive if-statement. Busy wait for array elements to be written is not that semantically different from a spin lock either.

which proves to be observably faster than this locking method.

Not sure what "this locking method" you're referring to. On an intuitive/idea level the modification in P2991 is not much different from the original patch. And there is an empiric measurement from 2 different systems which confirms timing is the same with much cleaner implementation.

Could you try this file? it hangs forever. Looks like you need to unlock at some places...

An educated speculation is that the deadlock is caused by an attempt to recursively acquire the lock. I don't fully understand all the code paths and when it is expected that recursion happens into the same node (and aliases like LineartBoundingArea *old_ba = root_ba; does not help local readability either), but it should be easy to restructure the lineart_bounding_area_link_triangle_cas that does not lead to a recursive lock.


All that being said, there seems to be a common misconception that atomics are cheap. They are cheaper than mutex, but they definitely are not coming for free. I can only suggest reading up about threads, threading models, and threads synchronization methods and primitives. Following common practices in this field will make it easier/faster to get speedup patches to be reviewed and accepted.

OK thanks for the explanation. I think I'll update the patch tomorrow for the spin-lock method which is more traditional and stable.

Just to clear up some things.
We tried to implement a CAS quad tree because we looked at a recent paper "Quadboost: A Scalable Concurrent Quadtree" that uses CAS to construct a quad tree.
According to the paper, this is the fastest way you can construct a quad tree (by using lock free methods to construct it).
This is why we choose to try to use atomic operations as we tried to copy the logic.

However we noticed that we couldn't copy the logic fully in the paper as it:

  1. Uses points instead of triangles.
  2. Only have one point/item per quad tile.

We have profiled the current algorithm and we know that currently we spend a very large chunk of the time simply just waiting for other threads when inserting new triangles.
After a lot of thinking and discussion Yiming and I didn't manage to figure out how to work around this with quad trees so we started asking other developers for insights.

We talked to Jaqcues and Hans and came to the conclusion that to make it scale even better, we would have to not use quad trees.
One idea is that use fixed grids and hope that the simpler logic and easier thread ability would out perform our current implementation.

However before we start with this future rewrite, I asked you Sergey if we should spend more time on testing or rewriting this into something better or if we should just merge what we have into master.
We came to the conclusion that we probably shouldn't spend to much time trying to polish our current result as we already know it is a dead end.
However because it was an improvement over the current master implementation, we agreed that we should probably try to merge it into master in the mean time.

I'm pointing this out as it seems like you might have gotten the wrong idea.
We know that the current method is not beautifully threaded and does not use CAS and other atomics in an optimal way.
At this point it would probably be better, as you pointed out, to simply just yank out the atomics and simply use spin locks.
It probably didn't occur to us as performance numbers from earlier stages where we hadn't added many spinlocks yet, showed that CASing the triangle array like we do now was faster on our test computers.
(Still need to verify this on our computers as Yiming pointed out)

Note that I'm not trying to argue that we shouldn't integrate any more of the feedback into the current code before merging into master.
I'm just trying to make clear that we are fully aware that the current method is lacking. But as it does make the code up to 8x faster than master, we think this would be a nice improvement despite its faults.

Surely the measured speedup is great, and is something very welcome addition for artists. But please do not do this in the fragile and hard to understand way presented in the current state of the patch. As I've explained above, CAS+busy lock is semantically the same as a spin lock. There is also a patch which shows that simpler (code and understandability wise) thread synchronisation yields the same performance, The remaining part of the change is avoiding recursive lock, which should not be difficult to implement.

Having simpler code makes it so others can more easily understand what's going on, maintain the code after the initial author is no longer active in the community, and allows others to more easily look into possible further improvements and make modifications without fear of breaking something.

Right. I don't disagree on any of that.

I just wanted to give context on why and how the code came to be. Because it seemed to me like you didn't understand on why we made the choices we made and ended up here.

What you pointed out indeed seems like a much better approach.

YimingWu (NicksBest) updated this revision to Diff 52139.EditedJun 2 2022, 1:29 AM

Fixed the recursive lock thing based on Sergey's modification, now it won't hang on mr_elephant file.


I totally agree the code simplicity thing, but I'm still gonna push back a little bit on this:

There is also a patch which shows that simpler (code and understandability wise) thread synchronisation yields the same performance, The remaining part of the change is avoiding recursive lock, which should not be difficult to implement.

We found that the real use case scenario is actually quite complicated. That 727 file doesn't really show that much difference, because geometries in that file are mostly spread apart, and was basically "made for line art". the locking method would result in poor performance in files like mr_elephant.blend because multiple threads are inserting into roughly the same position (a example would be that pot object on the shelf in that specific file). This cas-based optimization also split object into chunks for multiple threads to insert at the same time, which works in favour of a lock-free insertion method, but actually works against locking insertion because now all threads are trying to yield a few same locks. Because cas is robust enough in this aspect, it should be much faster in denser/wider variety of scenes that artists want to use (at least on my machine, see the performance fig below).

Maintainability wise... I do think the current patch has been explained pretty well and also clearly commented, a function like this is eventually gonna become a bit more intertwined anyway so I think it should largely be fine? I do understand that the traditional locking method is easier to understand and modify etc, yet this one isn't that much more complicated. Small price to pay for a performance increase? 🤔

Or... actually we may be able to include compiler options to switch between cas and locked-inserts? Doesn't appear to be that complicated.

Lets not go into the defensive mode, but rather help each other understanding what exactly is going on. That seems to be much better way of moving forward ;)

I am not sure what was the methodology you used to gather numbers you present on the screenshot. From a quick chat with Sebastian it seems that the numbers might be coming fro man older state of the code or a patch. This is not reliable way to quantify performance benefit of this specific patch and suggestions applied on top of it. Thing is: even with the lock on the node level the patch is different from the current state of the code in the master branch: the amount of refactor is big. And it all contributes to the overall performance.

To re-iterate on something I've mentioned above. You can not thing of a CAS as a lock-free, especially when you need to implement busy waits in areas to ensure thread safety. On an intuitive level CAS and wait is same as a spin lock.

The main semantic difference of the current patch (the single spin lock with the recursive lock) from the previous CAS state is that in the if (old_tri_index < max_triangles) { code branch the lineart_triangle_intersect_in_bounding_area is called outside of any lock, potentially allowing multiple threads to operate. However, there is something I did not mention after the busy wait added to the lineart_bounding_area_triangle_reallocate: to me it does not seem that the code is thread safe. I think it is still possible that the busy wait in lineart_bounding_area_triangle_reallocate will "see" that all array elements are not nullptr and start re-allocating while lineart_triangle_intersect_in_bounding_areawill be attempting to access the array elements. There might be something I'm missing w.r.t some other condition which prevents this from happening.

What I did is made a version of the simple simgle-spin-lock code to mimmic the seemingly-non-thread-safe-version by moving BLI_spin_unlock prior to lineart_triangle_intersect_in_bounding_area. Again, to me it seems wrong, but I wanted to collect some performance numbers.

So after that I did a number of tests on different systems to compare the CAS approach, the simple single-lock version (current state of the patch) and modification of the current patch which mimmics older locking better. The fuller test description and result are down below.

Test description:

  • Command run: ./bin/blender --debug-value 4000 | grep "Line art intersection time"
  • Open mr_elephant_LANPR.blend
  • Hit F12
  • There are two timing information printed (viewport and F12 I'd imagine). Both of the numbers are put into the table.

Test case 1: diff 52059
The previous version of the D14953 which has CAS and busy waits. This is an arc patch --diff 52059 applied on top of rBd3b3d723037.

Test case 2: diff 52139
This is the current version of the D14953. No CAS, single lock, with fixes from Yiming.

Test case 3: diff 52139 + early unlock
Same as above, but with the unlock happening prior to the lineart_triangle_intersect_in_bounding_area.

Test machines:

  • MacPro with 3.2 GHz 16-Core Intel Xeon W, 16 core 32 threads, called Xeon W in the table
  • 11th Gen Intel Core i9-11900K @ 3.5GHz, 8 core 16 threads, called i9-11900K in the table
diff 52059diff 52139diff 52139 + early unlock
Xeon W0.876426 0.3416330.840134 0.3125790.808048 0.327174
i9-11900K0.952197 0.2699970.962816 0.2691460.931403 0.263701

Keep in mind: allow ~1% noise floor, as the consecutive runs of the same case will lead to slightly different numbers.

From this quick test it seems that the current state of the patch is better than the previous CAS version of the patch (less busy waits, less synchronizations) and is more clearly thread-safe. If the early unlock prior to the lineart_triangle_intersect_in_bounding_area can be proven to be thread-safe we can get a bit more speedup.

My suggestion would be to move on with the current state of the patch. It is a huge improvement compared to what we currently have in the master branch, and the threading model/synchronization is clear.

If someone can measure a more-than-few-percent slowdown of the "simple" version of the patch compared to CAS, I'd say two things. First, make sure that the call to lineart_triangle_intersect_in_bounding_area in the CAS patch is thread-safe. Second, submit a separate patch which will be on top of an updated master branch as we will have the current state of this patch committed ;)

I'm marking the patch as accepted as its current state is something I understand, brings good speedup, and behaves great on 2 systems I have access to. And, again, I really think it something we can put into master branch, and if further tweaks are desired they can/should be done as separate patches (it is not great to leave artists without a huge speedup while us trying to squeeze possible couple of percent more speedup ;)

This revision is now accepted and ready to land.Jun 2 2022, 12:38 PM

OK I think I understood. I think the main difference I've seen with locking vs cas algorithm is probably due to some task scheduling stuff that I don't fully understand, and I guess my machine (8 threads) get it a tad bit faster because thread count is lower so the memory fence doesn't appear to cause as much disruption to the entire execution.

About putting BLI_spin_unlock before the lineart_triangle_intersect_in_bounding_area: because that call does give an "up_to" index, and this time all triangles are inserted sequentially, so it should technically work, but it does mean reallocate is unsafe. However, I think that mr. elephant file never really get to the deepest level where the reallocation is needed, so that's why I believe it did not crash there. A thing I think we can do is change reallocate to extend_but_dont_free_old_ones and free them later (A scene with reasonable density of geometry and without stacks of triangles overlapping one exact point will likely never reach that depth, and if so we only reallocate very few of those arrays), if so then it will not cause much of any problem, and then we can put unlock before the intersection call.

I'll clean up the comment in the code and patch description to reflect the chosen implementation.

Ah, indeed the patch description needs to be updated.

YimingWu (NicksBest) edited the summary of this revision. (Show Details)Jun 2 2022, 2:45 PM
YimingWu (NicksBest) edited the summary of this revision. (Show Details)