Page MenuHome

GPU: Thread safe index buffer builders.
ClosedPublic

Authored by Jeroen Bakker (jbakker) on Jun 4 2021, 12:59 PM.

Details

Summary

Current index builder is designed to be used in a single thread.
This makes all index buffer extractions single threaded.
This patch adds a thread safe solution enabling multithreaded
building of index buffers.

To reduce locking the solution would provide a task/thread local
index buffer builder (called sub builder).
When a thread is finished this thread local index buffer builder
can be joined with the initial index buffer builder.

GPU module impact

GPU_indexbuf_subbuilder_init: Initialized a sub builder. The
index list is shared between the parent and sub buffer, but the
counters are localized. Ensuring that updating counters would
not need any locking.

GPU_indexbuf_subbuilder_finish: merge the information of the
sub builder back to the parent builder. Needs to be invoked outside
the worker thread, or when sure that all worker threads have been
finished. Internal the function is not thread safe.

Mesh points extraction

For testing purposes the extract_points extractor has been migrated to
the new API. Herefore changes to the mesh extractor were needed.

  • When creating tasks, the task number of current task is stored in ExtractTaskData including the total number of tasks.
  • Adding two functions in MeshExtract.
    • task_init will initialize the task specific userdata.
    • task_finish should merge back the task specific userdata back.
  • adding task_id parameter to the iteration functions so they can access the correct task data without any need for locking.

Before the points are extracted single threaded.

After the change the points aren't part of the single threaded extraction.

There is no noticeable change in end user performance.

Diff Detail

Repository
rB Blender
Branch
T88822 (branched from master)
Build Status
Buildable 14963
Build 14963: arc lint + arc unit

Event Timeline

Jeroen Bakker (jbakker) requested review of this revision.Jun 4 2021, 12:59 PM
Jeroen Bakker (jbakker) created this revision.

It would be nice to see the impact of these changes on current multithreaded extractors.
The idea is good and easy to understand. But does the readability improvement justify having to select a task userdata in each callback? This seems to be a bit time-consuming.
I'll run some tests.

source/blender/draw/intern/draw_cache_extract_mesh.cc
720

Apparently here you could put "taskdata->task_id = atomic_fetch_and_add_int32(taskdata->task_counter, 1);" to simplify the code.

source/blender/draw/intern/draw_cache_extract_mesh_extractors.c
784

I'm not sure adding parameters to callbacks is a good alternative.
They consume registers/stack and may slow things down a bit.
But we can only be sure after the benchmark.

source/blender/draw/intern/mesh_extractors/extract_mesh_ibo_points.cc
179

Error C7555: use of designated initializers requires at least '/std:c++latest'

Germano Cavalcante (mano-wii) requested changes to this revision.Jun 4 2021, 2:35 PM

I'm having problems with this file:

This revision now requires changes to proceed.Jun 4 2021, 2:35 PM
source/blender/draw/intern/draw_cache_extract_mesh.cc
720

Yes, nice one!
it will remove some complexity

source/blender/draw/intern/draw_cache_extract_mesh_extractors.c
784

In a cleanup I want to put all parameters in a struct so it doesn't add more space on the stack. As these functions are callbacks no register optimizations is done for the callbacks. Everything goes via the stack.
My proposal would be to do this after we accept an approach and land it.

  • Generate task_id from task_counter.
  • Cleanup: remove extract_extract_iter_task_data_create_mesh.
  • Cleanup: rename extract_run_and_finish_init > extract_run_single_threaded.
  • Fix crash when extracting multiple iterators.
  • Fix compile error on windows.

I'm still seeing issues in the attached file:

Jeroen Bakker (jbakker) planned changes to this revision.Jun 5 2021, 8:10 AM

Reduce complexity by moving responsibility to the scheduler. Reduce number of unneeded taskinits and finished

  • Keep track of task user data in scheduled.
Jeroen Bakker (jbakker) edited the summary of this revision. (Show Details)Jun 7 2021, 12:01 PM
Jeroen Bakker (jbakker) marked an inline comment as done.
  • Removed unneeded changes.
  • Merge branch 'master' into temp-T88822-gpu-thread-safe-index-builder
  • Enabled multithreading for ibo.points.
  • Merge branch 'master' into temp-T88822-gpu-thread-safe-index-builder
  • Remove unneeded enter.

I will do some single thread benchmark tests to evaluate the overhead.
In the meantime here some remarks:

source/blender/draw/intern/draw_cache_extract_mesh.cc
980

task_len is a bit misleading. Wouldn't it be better task_len_max?
Also all tasks are distributed among the number of available threads, (including each iter_type in parallel).

A closer value would be:

const int task_len = num_threads + multi_threaded_extractors->iter_types_len();
source/blender/draw/intern/mesh_extractors/extract_mesh_ibo_points.cc
181

Where is use_threading set now?

source/blender/draw/intern/mesh_extractors/extract_mesh_ibo_points.cc
181

Maybe I didn't refresh the page.

Jeroen Bakker (jbakker) marked 2 inline comments as done.Jun 7 2021, 2:04 PM
Jeroen Bakker (jbakker) added inline comments.
source/blender/draw/intern/mesh_extractors/extract_mesh_ibo_points.cc
181

Maybe I didn't refresh the page.

181

Was removed during a merge by mistake

source/blender/draw/intern/draw_cache_extract_mesh.cc
980

Yes this is a worst case implementation. Would rather have something better, not sure your proposal is sufficient.

At max extract_task_in_ranges_create can create num_threads tasks per iter_type. As this count isn't available at this time I used a worst case implementation (num_threads * iter_types_len). We could make it more accurate to move the task_len determination to extract_task_in_ranges_create.

source/blender/draw/intern/draw_cache_extract_mesh.cc
980

This area needs some cleaning, but from what I can see, multiplying num_threads by multi_threaded_extractors->iter_types_len() is too much.

See extract_range_task_chunk_size_get. There, the total number of elements (looptris + polys + ledges + lverts) is calculated and then the result is divided by num_threads to get a range that distributes them all.
However, as the number of each element is not perfectly divisible. There can be a remainder for each iter.

But the value doesn't stray too far from num_threads.

Ideally I would prefer not to calculate the number of threads and let the driver distribute the ranges with parallel_for.

  • Addressed code review comments
  • Fix issue with recent merge.
Jeroen Bakker (jbakker) marked 2 inline comments as done.Jun 7 2021, 3:15 PM
Jeroen Bakker (jbakker) added inline comments.
source/blender/draw/intern/draw_cache_extract_mesh.cc
980

Par iter should be investigated as it might add a second level of parameter unpacking. But don't think the extra level would impact it too much.

Need #include <optional> to compile.

Benchmark
I compared single thread performance to evaluate overhead. Apparently the difference is negligible, and the result was even better in one of the tests.

master:PATCH:
large_mesh_editing:Average: 8.188309 FPSAverage: 8.146775 FPS
rdata 8ms iter 32ms (frame 122ms)rdata 9ms iter 33ms (frame 124ms)
large_mesh_editing_ledge:Average: 13.879269 FPSAverage: 13.737461 FPS
rdata 8ms iter 36ms (frame 72ms)rdata 8ms iter 36ms (frame 73ms)
looptris_test:Average: 4.020419 FPSAverage: 4.093891 FPS
rdata 11ms iter 88ms (frame 234ms)rdata 11ms iter 86ms (frame 230ms)
subdiv_mesh_cage_and_final:Average: 1.916560 FPSAverage: 1.924628 FPS
rdata 7ms iter 37ms (frame 263ms)rdata 7ms iter 39ms (frame 262ms)
rdata 7ms iter 39ms (frame 254ms)rdata 7ms iter 41ms (frame 254ms)
subdiv_mesh_final_only:Average: 6.463499 FPSAverage: 6.462770 FPS
rdata 3ms iter 20ms (frame 152ms)rdata 3ms iter 21ms (frame 149ms)
subdiv_mesh_final_only_ledge:Average: 6.389024 FPSAverage: 6.635199 FPS
rdata 3ms iter 20ms (frame 154ms)rdata 3ms iter 21ms (frame 145ms)

The difference may be more due to the new paging than CPU computation.

Some of the code has a style that could perhaps be more readable.
Part of the awkwardness was the use of void *&operator[](const TaskId task_id) in ExtractorRunData whose name already looks a lot like the vector ExtractorRunDatas.
(But maybe it's because I'm still not so used to using C++ in this way in Blender).

This patch really solves the multithreading issue with GPUIndexBuf, so it's worth the implementation.

Just remember to add the missing include before landing.

This revision is now accepted and ready to land.Jun 7 2021, 6:48 PM
Jeroen Bakker (jbakker) marked an inline comment as done.
  • Merge branch 'master' into arcpatch-D11499
This revision was automatically updated to reflect the committed changes.