Page MenuHome

Refactor: Draw Cache: use 'BLI_task_parallel_range'
ClosedPublic

Authored by Germano Cavalcante (mano-wii) on Jun 9 2021, 9:02 PM.

Details

Summary

This is an adaptation of D11488: Refactor: Draw Cache Extract Mesh: Use 'BLI_task_parallel_range'.

A disadvantage of manually setting the iter ranges per thread is that
we don't know how many threads are running in the background and so we
don't know how to best distribute the ranges.

To solve this limitation we can use parallel_reduce and thus let the
driver choose the best distribution of ranges among the threads.

This proved to be especially beneficial for computers with few cores.

Benchmarking:
Here's the result on an 4-core laptop:

master:PATCH:
large_mesh_editing:Average: 5.203638 FPSAverage: 5.398925 FPS
rdata 15ms iter 43ms (frame 193ms)rdata 14ms iter 36ms (frame 187ms)

Here's the result on an 8-core PC:

master:PATCH:
large_mesh_editing:Average: 15.267482 FPSAverage: 15.906881 FPS
rdata 9ms iter 28ms (frame 65ms)rdata 9ms iter 25ms (frame 63ms)
large_mesh_editing_ledge:Average: 15.145966 FPSAverage: 15.520474 FPS
rdata 9ms iter 29ms (frame 65ms)rdata 9ms iter 25ms (frame 64ms)
looptris_test:Average: 4.001917 FPSAverage: 4.061105 FPS
rdata 12ms iter 90ms (frame 236ms)rdata 12ms iter 87ms (frame 230ms)
subdiv_mesh_cage_and_final:Average: 1.917769 FPSAverage: 1.971790 FPS
rdata 7ms iter 37ms (frame 261ms)rdata 7ms iter 31ms (frame 258ms)
rdata 7ms iter 38ms (frame 252ms)rdata 7ms iter 33ms (frame 249ms)
subdiv_mesh_final_only:Average: 6.387240 FPSAverage: 6.591251 FPS
rdata 3ms iter 25ms (frame 151ms)rdata 3ms iter 16ms (frame 145ms)
subdiv_mesh_final_only_ledge:Average: 6.247393 FPSAverage: 6.596024 FPS
rdata 3ms iter 26ms (frame 158ms)rdata 3ms iter 16ms (frame 148ms)

Notes:

  • The improvement can only be noticed if all extracts are multithreaded.
  • This patch touches different areas of the code, so it can be split into another patch if the idea is accepted.

These screenshots show how threads behave in a quadcore:
Master:


Patch:

Diff Detail

Repository
rB Blender
Branch
arcpatch-D11558 (branched from master)
Build Status
Buildable 15104
Build 15104: arc lint + arc unit

Event Timeline

Germano Cavalcante (mano-wii) requested review of this revision.Jun 9 2021, 9:02 PM
Germano Cavalcante (mano-wii) created this revision.

Using GPU_indexbuf_subbuilder_finish in parallel_reduce's free callback did not appear to be thread_safe.
So:

  • Redo the GPUIndexBuf multithread API
  • Undo changes to BLI_task
  • Use the callback to join chunks

Main concern is that seems to limit further optimizations of extract_tris. where the task data will be different per thread and cannot be created by duplicated memory. Haven't look at how easy it is to bring that flexibility back.
Any ideas how we can solve this?

source/blender/gpu/intern/gpu_index_buffer.cc
82

GPU_indexbuf_join would be enough.

  • Fix testes and cleanup (rename functions)
  • Cleanup: rename task_finish --> task_reduce

The *(uint32_t **) is the only thing that I don't like in code style, but we can clean that up in master.

source/blender/draw/intern/draw_cache_extract_mesh_private.h
197

Can be removed.

This revision is now accepted and ready to land.Jun 11 2021, 12:26 PM