Page MenuHome

Geometry Node: Parallel safe disjoins find_root
AbandonedPublic

Authored by Iliya Katueshenock (Moder) on Nov 12 2022, 3:46 AM.

Details

Summary

Geometry Node: Parallel safe disjoins find_root

The security of the search is explained by the fact that regardless of
the rewriting of the shortest path to the root, the algorithm always
works the same way. This means that even in the case of a write race
and reading, no matter how atamarically written (rewriting types with
a machine word) new number appears there.


Root lookup is thread safe. Depending on the distribution of
the order and the length of the links, parallelism can give from
2% to 50% increase (only for searching and recording the group
number) of the running time. It would be difficult for tests to
cover all variants, so in most cases it will be something in the region
of 10% (as the most favorable forecast)

Diff Detail

Event Timeline

Jacques Lucke (JacquesLucke) requested changes to this revision.Nov 12 2022, 11:07 AM

For actual thread safety, atomics have to be used for anything that might be written to and read from at the same time. It should be enough to use std::memory_order_relaxed in all cases here though.

This revision now requires changes to proceed.Nov 12 2022, 11:07 AM

While I was looking for information for one of the past patches, I found out that for machine word types, record atamarity is guaranteed
But I now managed to understand that for some architectures int64_t or more can consist of a couple of words
And in this case, this security can in rare cases leak
But it seemed to me that it would be expensive to use atamarity, so I will try to experiment with this

I realized that bypassing direct parallelism, you break everything into parallel chunks and just shorten the path to the root, but not solve it.
In this case, all that is needed is to have an ordered path to the root. Instead of rank, can make all connections in index-order dirrection.
In this example, I'm using the origin direction. That is, each element at the beginning of the list can correspond to several elements with a higher index.
Due to this, passing through all the elements in a loop, can sequentially propagate all local roots.
As a defense, I check if the element that is previous in the list belongs to the current chunk. If it is, then it has already been processed before and contains the shortest path to the root of the chunk. If it does not belong to a chunk, then the index itself, the root of the chunk, will count.
The subsequent search can bypass all chunks at most 1 time. Or, if you're lucky, go through only 1.


1diff --git a/source/blender/blenlib/BLI_disjoint_set.hh b/source/blender/blenlib/BLI_disjoint_set.hh
2index 3adc609dad3..fdee7c9b956 100644
3--- a/source/blender/blenlib/BLI_disjoint_set.hh
4+++ b/source/blender/blenlib/BLI_disjoint_set.hh
5@@ -10,6 +10,7 @@
6
7 #include "BLI_array.hh"
8 #include "BLI_index_range.hh"
9+#include "BLI_task.hh"
10
11 namespace blender {
12
13@@ -54,6 +55,23 @@ template<typename T = int64_t> class DisjointSet {
14 ranks_[root1]++;
15 }
16 }
17+
18+ void join_fast(const T x, const T y)
19+ {
20+ T root1 = x;//this->find_root(x);
21+ T root2 = y;//this->find_root(y);
22+
23+ /* x and y are in the same set already. */
24+ if (root1 == root2) {
25+ return;
26+ }
27+
28+ /* Implement union by order. */
29+ if (parents_[root1] < parents_[root2]) {
30+ std::swap(root1, root2);
31+ }
32+ parents_[root1] = parents_[root2];
33+ }
34
35 /**
36 * Return true when x and y are in the same set.
37@@ -65,6 +83,19 @@ template<typename T = int64_t> class DisjointSet {
38 return root1 == root2;
39 }
40
41+ void root_prepare(const int grain_size)
42+ {
43+ threading::parallel_for(parents_.index_range(), grain_size, [&](const IndexRange range) {
44+ for (const int64_t i : range){
45+ if (parents_[i] != i){
46+ if (range.contains(parents_[i])){
47+ parents_[i] = parents_[parents_[i]];
48+ }
49+ }
50+ }
51+ });
52+ }
53+
54 /**
55 * Find the element that represents the set containing x currently.
56 */
57diff --git a/source/blender/nodes/geometry/nodes/node_geo_input_mesh_island.cc b/source/blender/nodes/geometry/nodes/node_geo_input_mesh_island.cc
58index d658b14092a..1be01235bd1 100644
59--- a/source/blender/nodes/geometry/nodes/node_geo_input_mesh_island.cc
60+++ b/source/blender/nodes/geometry/nodes/node_geo_input_mesh_island.cc
61@@ -6,6 +6,8 @@
62 #include "BKE_mesh.h"
63
64 #include "BLI_disjoint_set.hh"
65+#include "BLI_task.hh"
66+#include "BLI_timeit.hh"
67
68 #include "node_geometry_util.hh"
69
70@@ -35,16 +37,51 @@ class IslandFieldInput final : public bke::MeshFieldInput {
71 {
72 const Span<MEdge> edges = mesh.edges();
73
74- DisjointSet<int> islands(mesh.totvert);
75- for (const int i : edges.index_range()) {
76- islands.join(edges[i].v1, edges[i].v2);
77+ {
78+ SCOPED_TIMER("Fast");
79+ DisjointSet<int> islands(mesh.totvert);
80+ {
81+ SCOPED_TIMER("Fast Join");
82+ for (const int i : edges.index_range()) {
83+ islands.join_fast(edges[i].v1, edges[i].v2);
84+ }
85+ }
86+
87+ {
88+ SCOPED_TIMER("Fast Prepare");
89+ islands.root_prepare(1024);
90+ }
91+
92+ Array<int> output(mesh.totvert);
93+ VectorSet<int> ordered_roots;
94+ {
95+ SCOPED_TIMER("Fast Find");
96+ for (const int i : IndexRange(mesh.totvert)) {
97+ const int root = islands.find_root(i);
98+ output[i] = ordered_roots.index_of_or_add(root);
99+ }
100+ }
101 }
102-
103- Array<int> output(mesh.totvert);
104- VectorSet<int> ordered_roots;
105- for (const int i : IndexRange(mesh.totvert)) {
106- const int root = islands.find_root(i);
107- output[i] = ordered_roots.index_of_or_add(root);
108+
109+ Array<int> output(mesh.totvert);
110+ {
111+ SCOPED_TIMER("Simple");
112+ DisjointSet<int> islands(mesh.totvert);
113+ {
114+ SCOPED_TIMER("Simple Join");
115+ for (const int i : edges.index_range()) {
116+ islands.join(edges[i].v1, edges[i].v2);
117+ }
118+ }
119+
120+ VectorSet<int> ordered_roots;
121+ {
122+ SCOPED_TIMER("Simple Find");
123+ for (const int i : IndexRange(mesh.totvert)) {
124+ const int root = islands.find_root(i);
125+ output[i] = ordered_roots.index_of_or_add(root);
126+ }
127+ }
128 }
129
130 return mesh.attributes().adapt_domain<int>(


1Timer 'Fast Join' took 163.5 ms
2Timer 'Fast Prepare' took 1.8 ms
3Timer 'Fast Find' took 62.7 ms
4Timer 'Fast' took 260.5 ms
5Timer 'Simple Join' took 205.6 ms
6Timer 'Simple Find' took 41.2 ms
7Timer 'Simple' took 269.6 ms
8Timer 'Fast Join' took 289.7 ms
9Timer 'Fast Prepare' took 3.6 ms
10Timer 'Fast Find' took 94.8 ms
11Timer 'Fast' took 455.0 ms
12Timer 'Simple Join' took 383.0 ms
13Timer 'Simple Find' took 59.6 ms
14Timer 'Simple' took 484.5 ms
15Timer 'Fast Join' took 39.1 ms
16Timer 'Fast Prepare' took 0.5 ms
17Timer 'Fast Find' took 15.4 ms
18Timer 'Fast' took 1326.3 ms
19Timer 'Simple Join' took 53.2 ms
20Timer 'Simple Find' took 10.1 ms
21Timer 'Simple' took 68.4 ms
22Timer 'Fast Join' took 34.2 ms
23Timer 'Fast Prepare' took 0.4 ms
24Timer 'Fast Find' took 15.8 ms
25Timer 'Fast' took 56.9 ms
26Timer 'Simple Join' took 53.1 ms
27Timer 'Simple Find' took 9.9 ms
28Timer 'Simple' took 69.2 ms
29Timer 'Fast Join' took 2.6 ms
30Timer 'Fast Prepare' took 0.1 ms
31Timer 'Fast Find' took 1.8 ms
32Timer 'Fast' took 6.3 ms
33Timer 'Simple Join' took 4.5 ms
34Timer 'Simple Find' took 1.3 ms
35Timer 'Simple' took 6.5 ms
36Timer 'Fast Join' took 5.2 ms
37Timer 'Fast Prepare' took 0.2 ms
38Timer 'Fast Find' took 2.4 ms
39Timer 'Fast' took 9.6 ms
40Timer 'Simple Join' took 6.4 ms
41Timer 'Simple Find' took 1.9 ms
42Timer 'Simple' took 9.3 ms
43Timer 'Fast Join' took 11.1 ms
44Timer 'Fast Prepare' took 0.2 ms
45Timer 'Fast Find' took 5.0 ms
46Timer 'Fast' took 20.5 ms
47Timer 'Simple Join' took 14.3 ms
48Timer 'Simple Find' took 3.6 ms
49Timer 'Simple' took 19.9 ms
50Timer 'Fast Join' took 19.0 ms
51Timer 'Fast Prepare' took 0.4 ms
52Timer 'Fast Find' took 6.4 ms
53Timer 'Fast' took 31.9 ms
54Timer 'Simple Join' took 27.5 ms
55Timer 'Simple Find' took 6.9 ms
56Timer 'Simple' took 38.7 ms
57Timer 'Fast Join' took 39.1 ms
58Timer 'Fast Prepare' took 0.6 ms
59Timer 'Fast Find' took 12.6 ms
60Timer 'Fast' took 65.9 ms
61Timer 'Simple Join' took 62.7 ms
62Timer 'Simple Find' took 11.2 ms
63Timer 'Simple' took 82.8 ms
64Info: Saved "distjoins island index.blend"
65
66Timer 'Fast Join' took 56.6 ms
67Timer 'Fast Prepare' took 0.7 ms
68Timer 'Fast Find' took 27.5 ms
69Timer 'Fast' took 99.4 ms
70Timer 'Simple Join' took 79.1 ms
71Timer 'Simple Find' took 17.9 ms
72Timer 'Simple' took 107.9 ms


  • It's kind of important that a heuristic like "union by rank" is used instead of your "union by order", because that can have a high impact on worst case performance. For some details see: https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Merging_two_sets.
  • I don't quite understand the purpose of root_prepare. It looks like it can improve the quality of the tree a little bit, but doesn't really change anything overall in terms of threading correctness.

Using atomics might be ok here actually. Elements are written to relatively rarely, and on x86/64 the writes would be equivalent to non-atomic writes (with relaxed memory order). It might be benefitial to use only one array for parents_ and ranks_ instead of two, to reduce the likelyhood of false sharing.

As I understand it, the article compares only methods for creating weight, namely rank and total.
My idea was to represent the tree as a flat structure, directed from the very beginning (where the root is) to the end (the last elements)
Now, if we divide this tree into floors by height, then within each floor all branches can be represented as a root and its branches. That is, the subsequent search will require a traversal from one element to the number of floors.
The more floors, the shorter the search. The more floors, the more parallelism.

Yes, I understand, i still have to use atomic (Found and painted over a white spot in my understanding of the problem)

Iliya Katueshenock (Moder) retitled this revision from Geometry Node: Parallel safe distjoins find_root to Geometry Node: Parallel safe disjoins find_root.Nov 21 2022, 12:21 PM
Iliya Katueshenock (Moder) edited the summary of this revision. (Show Details)

All my tests could not even come close to giving such an increase as in D16653: BLI: Add atomic disjoint set data structure.