This task is related to work done on our GHash in [temp-ghash-experiments](https://developer.blender.org/diffusion/B/browse/temp-ghash-experiments/) branch.
Reducing Load
===========
Here are the results of some tests, ran using [this corpus](http://corpora.uni-leipzig.de/downloads/eng_wikipedia_2010_1M-text.tar.gz) (1 million of words, about 122MB of text), from [corpora.informatik.uni-leipzig.de](http://corpora.informatik.uni-leipzig.de/download.html).
Inserted keys were: the whole text, all its lines, and all its words, leading to 2 917 605 entries.
| Pre-reservation of buckets | Hash | LoadT | Type | Insert ms| Lookup ms | Quality | Variance | Load % | Empty % | Overloaded %
| -------------------------------- | ---- | ----- | ---- | ------ | ------ | ------- | -------- | ------ | ------- | ------------
| Strings (1M words, 122M file) | GH | 0.75 | Mask | 2.556978 | 3.036947 | 1.002262 | 0.349653 | 34.78 | 70.68 | 4.84 (7)
| Strings (1M words, 122M file) | GH | 0.75 | Mod | 2.741920 | 3.156516 | 1.002311 | 0.349693 | 34.78 | 70.68 | 4.85 (7)
| Strings (1M words, 122M file) | GH | 3 | Mask | 3.567847 | 4.707220 | 1.002759 | 1.404238 | 139.12 | 25.03 | 5.36 (10)
| Strings (1M words, 122M file) | GH | 3 | Mod | 4.116788 | 5.141498 | 1.001088 | 1.396343 | 139.12 | 24.94 | 5.29 (11)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mask | 2.836161 | 3.226247 | 1.000052 | 0.347848 | 34.78 | 70.63 | 4.82 (6)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mod | 3.083521 | 3.520704 | 0.999960 | 0.347772 | 34.78 | 70.62 | 4.81 (7)
| Strings (1M words, 122M file) | MM2 | 3 | Mask | 3.807093 | 4.904332 | 0.999995 | 1.391200 | 139.12 | 24.89 | 5.27 (11)
| Strings (1M words, 122M file) | MM2 | 3 | Mod | 4.198665 | 5.395439 | 0.999922 | 1.390842 | 139.12 | 24.89 | 2.28 (9)
| No pre-reservation of buckets | Hash | LoadT | Type | Insert ms| Lookup ms | Quality | Variance | Load % | Empty % | Overloaded %
| -------------------------------- | ---- | ----- | ---- | ------ | ------ | ------- | -------- | ------ | ------- | ------------
| Strings (1M words, 122M file) | GH | 0.75 | Mask | 6.177619 | 3.579811 | 1.001687 | 0.698775 | 69.56 | 49.95 | 15.45 (8)
| Strings (1M words, 122M file) | GH | 0.75 | Mod | 6.670569 | 4.010712 | 1.002150 | 0.699641 | 69.56 | 49.96 | 15.43 (8)
| Strings (1M words, 122M file) | GH | 3 | Mask | 7.459112 | 5.181792 | 1.002515 | 2.815917 | 278.24 | 6.29 | 30.41 (13)
| Strings (1M words, 122M file) | GH | 3 | Mod | 7.634238 | 5.366527 | 1.000709 | 2.791862 | 278.24 | 6.23 | 30.45 (14)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mask | 5.133478 | 3.935534 | 1.000095 | 0.695790 | 69.56 | 49.88 | 15.43 (9)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mod | 5.434215 | 4.090286 | 0.999919 | 0.695456 | 69.56 | 49.88 | 15.44 (8)
| Strings (1M words, 122M file) | MM2 | 3 | Mask | 6.287116 | 5.402290 | 1.000289 | 2.786288 | 278.24 | 6.20 | 30.43 (15)
| Strings (1M words, 122M file) | MM2 | 3 | Mod | 6.577213 | 5.520325 | 0.999844 | 2.780355 | 278.24 | 6.17 | 30.47 (16)
Notes:
* //GH// stands for standard GHash hashing, //MM2// for Murmur2 method
* //LoadT// is the 'load threshold', i.e. the maximum average amount of entries per bucket.
* //Overloaded %// is the percentage of buckets holding more entries than max(1, LoadT) (i.e. two or more for 0.75, four or more for 3).
* Did same tests using first 100M integers, and 10M random vectors of four integers, as keys. Results are more or less similar as with strings.
From that we can conclude (at least for now) that:
* Our current load threshold (3) is really an issue! We are talking of several tens of percent slower than with a threshold of 0.75, in insertion **and** lookup - memory saving seems really not worth it here!
* Murmur2 hashing is //slightly// better than GHash one, but tradeoff in performances is not worth it usually (only exception would be big-enough strings, where it becomes better than our char-looping current hashing).
* Modulo is even more slightly better than masking, but again, noticeably slower.
Also, we can see how important it is to pre-allocate (reserve) GHash buckets whenever possible. Resizing buckets is a very expansive operation. We can also see how interesting it can be to over-reserve buckets for better lookup times (see differences in lookup times between different % of load).
----------
Storing Hashes
===========
And now, here are some results with hashes stored in entries:
| Pre-reservation of buckets | Hash | LoadT | Type | Insert ms| Lookup ms | Quality | Variance | Load % | Empty % | Overloaded %
| -------------------------------- | ---- | ----- | ---- | ------ | ------ | ------- | -------- | ------ | ------- | ------------
| Strings (1M words, 122M file) | GH | 0.75 | Mask | 2.182862 | 2.383782 | 1.002262 | 0.349653 | 34.78 | 70.68 | 4.84 (7)
| Strings (1M words, 122M file) | GH | 0.75 | Mod | 2.438983 | 2.601138 | 1.002311 | 0.349693 | 34.78 | 70.68 | 4.85 (7)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mask | 2.439041 | 2.674447 | 1.000052 | 0.347848 | 34.78 | 70.63 | 4.82 (6)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mod | 2.624448 | 2.830084 | 0.999960 | 0.347772 | 34.78 | 70.62 | 4.81 (7)
| No pre-reservation of buckets | Hash | LoadT | Type | Insert ms| Lookup ms | Quality | Variance | Load % | Empty % | Overloaded %
| -------------------------------- | ---- | ----- | ---- | ------ | ------ | ------- | -------- | ------ | ------- | ------------
| Strings (1M words, 122M file) | GH | 0.75 | Mask | 2.475735 | 2.459534 | 1.001687 | 0.698775 | 69.56 | 49.95 | 15.45 (8)
| Strings (1M words, 122M file) | GH | 0.75 | Mod | 2.743342 | 2.621367 | 1.002150 | 0.699641 | 69.56 | 49.96 | 15.43 (8)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mask | 2.825784 | 2.770305 | 1.000095 | 0.695790 | 69.56 | 49.88 | 15.43 (9)
| Strings (1M words, 122M file) | MM2 | 0.75 | Mod | 3.047072a | 2.975727 | 0.999919 | 0.695456 | 69.56 | 49.88 | 15.44 (8)
Benefits are obvious, non-reserved times are now quite close to reserve times even!
----------
About memory
===========
Here are some (rough) estimations for 100M of entries:
| Memory usage for 100M entries | GHash | | |GSet | |
| ------------------------------- | ------- | ------- | ----- | ------- | ------- | -----
| Memory (in MB) | Buckets | Entries | Total | Buckets | Entries | Total
| ------------------------------- | ------- | ------- | ----- | ------- | ------- | -----
| Current Master (load: 3) | 268 | 2400 | 2668 | 268 | 1600 | 1868
| Load: 0.75 | 1074 | 2400 | 3474 | 1074 | 1600 | 2674
| Load: 0.75, storing hashes | 1074 | 3200 | 4274 | 1074 | 2400 | 3474
In other words, all proposed changes approximately double the memory footprint of GHash/GSet.
But our typical usage in Blender nearly never involves more than a few millions entries
(usually much less actually), so I would not consider that an issue?