This patch proposes to use a new type of builder for GPUIndexBuf. The
GPUIndexBufBuilderAsync which is designed to work in multithreaded
operations.
Its usage is the same as GPUIndexBufBuilder, just use the respective
"_async_" functions.
The solution in this patch was to use thread_local ot allocate a
unique pointer per thread.
This pointer is lazily initialized if the thread_id in the global slot
indicated by the builder is different from the thread_local_id of the
builder.
The downside of this solution is that we have a limited number of slots
per thread (8).
However, with the existing extracts and with the possibility of
creating at most two MeshBufferCache in parallel, the maximum number
of GPUIndexBufBuilderAsync used at the same time in Blender is 4.
Profiling
Profiling was done in single thread to evaluate the overhead of the new
async builder compared to the original.
So it is to be expected that it will be slower in this case due to lazy
initialization and the buffers that are virtually mapped (so they can
have an address far from the function).
| master: | PATCH SINGLE THREAD: | |
|---|---|---|
| large_mesh_editing: | Average: 6.864108 FPS | Average: 6.687620 FPS |
| rdata 9ms iter 35ms (frame 146ms) | rdata 9ms iter 37ms (frame 150ms) | |
| large_mesh_editing_ledge: | Average: 10.356587 FPS | Average: 9.927873 FPS |
| rdata 9ms iter 38ms (frame 98ms) | rdata 9ms iter 39ms (frame 101ms) | |
| looptris_test: | Average: 3.502261 FPS | Average: 3.472811 FPS |
| rdata 12ms iter 94ms (frame 266ms) | rdata 12ms iter 96ms (frame 270ms) | |
| subdiv_mesh_cage_and_final: | Average: 1.763707 FPS | Average: 1.789758 FPS |
| rdata 7ms iter 44ms (frame 288ms) | rdata 7ms iter 49ms (frame 282ms) | |
| rdata 7ms iter 47ms (frame 279ms) | rdata 7ms iter 52ms (frame 274ms) | |
| subdiv_mesh_final_only: | Average: 6.046027 FPS | Average: 5.846583 FPS |
| rdata 3ms iter 23ms (frame 161ms) | rdata 3ms iter 25ms (frame 170ms) | |
| subdiv_mesh_final_only_ledge: | Average: 5.931267 FPS | Average: 5.921739 FPS |
| rdata 3ms iter 23ms (frame 164ms) | rdata 3ms iter 26ms (frame 164ms) | |
The overhead affecting in the worst case 11% of the performance seems
to be acceptable, since this function is destined to a specific case
which is the use in multithreaded.
Alternative solutions wouldn't do much better either.
This patch applies this new API to the extracts:
- extract_lines
- extract_lines_with_lines_loose
- extract_points