Using multithread for add_triangles to speed up quad tree building. Each thread would lock its respective tile to work on triangle insertion, intersection calculation, tile splitting and triangle array extension.
Originally this patch is using atomic "compare and swap" to get around the lock and try to insert more triangles in to the quad tree in shorter time, however after extensive trial and analysis, we found out that using a more traditional locking approach as opposed to using cas to try to avoid locking is better in terms of thread scheduling, and a cas with "busy wait" isn't really a lock-free algorithm. So we added the lock back and based on the test that Sergey did, it's proven to be more robust and reasonably scalable. The overall logic and code path is the same as cas method described below, it's just we decide to not use cas to synchronize thread data.
Further optimization would need to overhaul the 2d acceleration structure, which may include stuff like fixed grid. This is for future experimentation.
Old information on cas algorithm for context
Using the atomic "compare and swap" method in add_triangles stage dramatically speeds up the add_triangles call and significantly reduced the overall calculation time. This is currently the fastest method we have experimented with so far.
Note: Needs D15020 to compile!
This performance graph is ran on a machine with this setup
- CPU: Intel 4910MQ (8 threads)
- RAM: 32GB DDR3L
- Xubuntu 20.04 LTS
How it works
Lineart needs to build a 2D quad tree and link triangles into it to make sure occlusion/intersection are calculated efficiently without going through a lot of iterations, similar to what a BVH does in 3D. LineartBoundingArea is the 2D quad tree node.
Originally Lineart uses single thread to build this quad tree, in temp-lineart-contained it actually used a per-node lock to achieve multithread insertion, but locking the node heavily can lead to inefficiency when multiple threads are inserting into the same tile. After researching for different methods, me and @Sebastian Parborg (zeddb) found out that the fastest way currently (on our machines) is to use atomic compare and swap to handle inserting, this way we spend way less time in locks, we only lock a specific tile when we need to split that tile or we filled up the triangle array inside that tile. The "per-node lock" algorithm was never integrated into master.
To briefly describe how the quad tree is built:
- There will be a initial tile of ~10x10, each tile has a limit of 100 triangles, more than that it will need to split into 4 child ones, or (if at max depth) extend the array.
- Threads each have a batch of triangles to be inserted into the tile (lineart_bounding_area_link_triangle_cas).
- When inserting into a tile, use cas (compare and swap) to get a unique index and assign the triangle into the array.
- If the tile is full, lock the tile and split (or extend the array), when done, release the lock.
- After inserting a triangle, the thread would also test this triangle with other triangles that are inserted prior to this triangle for intersections.
- Repeat until all triangles are inserted.
- Generate intersection edges from thread intersection results.
- Relink tile and its adjacent tiles for convenience of later occlusion stage edge marching.
Here is a logic flow to show how lineart_bounding_area_link_triangle_cas under the hood, this way it's easier to get a general idea of what the insertion function is doing.
Other stuff
Additional changes includes moving some of other auxiliary calls (like relink adjacent tiles) outside of what is originally inside the triangle insertion call, this way we give the max performance for triangle insertion.
Notes
Notes on the cas and atomic_load/store calls, and why we only use that in specific places:
- The reason for using cas instead of atomic_add for getting an index (for triangle insertion) is that when we have multiple threads incrementing the index, it could overflow the array length limit. We need "compare and swap" done in atomic so we can ensure the updated value is always the latest.
- The only two places that atomic_load_z is used is when the thread needs to read addresses that are written by (likely) other threads. Those are the tile splitting function (understandably, the tile contains all the triangles that different threads inserted into), and tri-tri intersection function (which will compare previously added triangles for intersection up to its own index, which will also contain triangles inserted by other threads).
- The only place atomic_store_z is used is when thread is writing a triangle address to the linked_triangles array, it will need to make sure the write is visible to all other threads.
Notes on the "busy wait":
- We need to try atomic_load_z the linked triangle pointer repeatedly if it's NULL, Because at that state, other threads are just finishing up assigning those last few triangles into the array, they would be done very quickly (in the case of tile-splitting, at most we only need to wait for the last thread_count amount of indices), so it should not be a big problem and I think this is a reasonable way of satisfying the quirk of this algorithm.


