Page MenuHome

Refactor: DRW Mesh Extractor: Join the extractors in a same loop
ClosedPublic

Authored by Germano Cavalcante (mano-wii) on May 24 2021, 2:48 PM.

Details

Summary

This patch replaces / redoes the entire MeshExtractors system.
Although they were useful and facilitated the addition of new buffers, they made it difficult to control the threads and added a lot of threading overhead.

Part of the problem was in traversing the same loop type in different threads. The concurrent access of the BMesh Elements slowed the reading.

This patch simplifies the use of threads by merging all the old callbacks from the extracts into a single series of iteration functions.

The type of extraction can be chosen using flags.

This optimized the process by around 34%.


System Information
CPU: AMD Ryzen 7 1800X Eight-Core Processor
Operating system: Windows-10-10.0.19041-SP0 64 Bits
Graphics card: Radeon (TM) RX 480 Graphics ATI Technologies Inc. 4.5.14760 Core Profile Context 20.45.37.01 27.20.14537.1001

Diff Detail

Repository
rB Blender
Branch
arcpatch-D11375_1 (branched from master)
Build Status
Buildable 14796
Build 14796: arc lint + arc unit

Event Timeline

Germano Cavalcante (mano-wii) requested review of this revision.May 24 2021, 2:48 PM
Germano Cavalcante (mano-wii) created this revision.

The fact I used a copy of MeshBufferCache mbc, is to be able to set the pointers to null for non requested buffers (like already filled ones). See DRW_TEST_ASSIGN_VBO & DRW_TEST_ASSIGN_IBO.

About passing it to all callbacks, I don't know. It's less clear what the callbacks should be working with to avoid race condition. One could think you can access the whole mbc in a thread safe manner.

Although I do agree the void * are less than ideal.

  • Do not change MeshBufferCache values;

It's less clear what the callbacks should be working with to avoid race condition. One could think you can access the whole mbc in a thread safe manner.

In fact, this is less clear.
Passing a simple void *buf implies that the extract is an individual operation that only writes to that specific buffer. Thus, the pointer is entirely in charge of mesh_buffer_cache_create_requested.

If it is a mandatory design, I will think of an alternative solution. But something I noticed is that, if we join the different callbacks in the same respective loop (looptri, poly, ledges, lverts) it can result in a significant improvement in performance (Something around 34% faster).
So it might be a good idea to rethink the idea of keeping these extracts independent.

It's less clear what the callbacks should be working with to avoid race condition. One could think you can access the whole mbc in a thread safe manner.

In fact, this is less clear.
Passing a simple void *buf implies that the extract is an individual operation that only writes to that specific buffer. Thus, the pointer is entirely in charge of mesh_buffer_cache_create_requested.

Well isn't it the case?

If it is a mandatory design, I will think of an alternative solution. But something I noticed is that, if we join the different callbacks in the same respective loop (looptri, poly, ledges, lverts) it can result in a significant improvement in performance (Something around 34% faster).
So it might be a good idea to rethink the idea of keeping these extracts independent.

So you mean having a general loop and have the callbacks inside the loop?

Well isn't it the case?

Yes, I was just confirming!

So you mean having a general loop and have the callbacks inside the loop?

That is the idea.
I made a test patch for profiling this, it is not presentable, but I could see the benefit.
The file I am testing (high poly mesh in edit mode with several overlays), performance went from 53 to 36ms with room for improvement:

To make the patch presentable I am looking for an alternative solution (perhaps replacing all extracts since they can no longer be independent).

Ah I see. So eventually a future patch will improve data localization with the effect of more branch prediction misses.

In pseudo:

for batch (a scheduled task) {
  per loop type
     for multithreaded_requested_extractors {
        multithreaded_requested_extractors.loop type function()
     }
}

There are a lot of possibilities to increase the performance here. Like grouping and ordering. I assume you are referring to this when talking about room for improvements?

An idea we should discuss is to have a template class for this.

template<typename Buffer, typename Data=void> class MeshBatchExtractor {
    virtual Buffer* get_buffer(MeshBufferCache& mbc) = 0;
    virtual void finish(const MeshRenderData *mr,
                        MeshBatchCache *cache,
                        Buffer *ibo,
                        Data *data) = 0;

};

class MyExtractor: public MeshBatchExtractor<GPUIndexBuffer, void> {
    GPUIndexBuffer* get_buffer(MeshBufferCache& mbc) override {
        /* returns a reference of an indexbuffer in MBC, fe `&mbc.ibo.tris`;
         * this reference will be passed to the callback functions. */
        return nullptr;
    }

    void finish(const MeshRenderData *mr,
                MeshBatchCache *cache,
                GPUIndexBuffer *ibo,
                void *data) override {}
};
Germano Cavalcante (mano-wii) planned changes to this revision.May 25 2021, 3:06 PM
  • Replace all extracts with flags that are used to identify the type of extraction in single iter functions

I ended up having to redo a large part of the code.
Also forgot the DEBUG_TIME of this patch is uncommented (can be useful to check the results)
Now all extracts are joined in a single loop.
In the attached file, this resulted in a change from 53 to 35ms.

Germano Cavalcante (mano-wii) retitled this revision from Cleanup/Refactor: DRW: pass the 'MeshBufferCache' as a parameter instead of a GPU Buffer for the init and finish callbacks of the extracts to Refactor: DRW Mesh Extractor: Join the extractors in a same loop.May 26 2021, 1:04 AM
Germano Cavalcante (mano-wii) edited the summary of this revision. (Show Details)
  • Remove node_shader_curves.c (not part of the patch)

This patch optimize ram access before branch prediction and still leave room to optimize further. I think this patch is something we should strife for, great idea and initial patch! Although I have some concerns on readability and maintainability.

The part of the code that I don't like is that you need changes in many areas when adding or changing extractions. This adds unclarity to other devs. I would rather see an approach where an extraction is configured centrally but still can be used in this mechanism. As this file is getting big this we should we should be able to place all the relevant pieces of an extraction in separate compile units. I don't expect the compilation unit separation part of this patch, but would expect a solution that doesn't block us in doing this.

Idea: you can increase branch prediction and increase the code centralization by replacing the wall of if statements with a for loop over extractor structs that are enabled for that specific task.

source/blender/draw/intern/draw_cache_extract_mesh.c
82

Use enum type.

124

Unclear what ASYNC means in this context.

6371

Compilation warnings in GCC.

/home/jeroen/blender-git/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:6367:3: note: in expansion of macro ‘EXTRACT_ADD_REQUESTED’
 6367 |   EXTRACT_ADD_REQUESTED(IBO, ibo, edituv_tris);
      |   ^~~~~~~~~~~~~~~~~~~~~
/home/jeroen/blender-git/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:6178:51: note: expected ‘void **’ but argument is of type ‘GPUIndexBuf **’
 6178 |                                            void **buf_ref,
      |                                            ~~~~~~~^~~~~~~
/home/jeroen/blender-git/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:6333:16: warning: passing argument 2 of ‘mesh_render_data_requested’ from incompatible pointer type [-Wincompatible-pointer-types]
 6333 |           mbc, &mbc->type_lowercase.name, do_hq_normals, do_lines_loose_subbuffer); \
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                |
      |                GPUIndexBuf **
 6334 |     } \
      |     ~~~         
 6335 |   } while (0)
      |   ~~~~~~~~~~~
source/blender/draw/intern/draw_cache_impl_mesh.c
1163

This cleanup can be committed to master directly.

Germano Cavalcante (mano-wii) marked 3 inline comments as done.
  • Cleanup: rename variables, rearrange code, supress warnings
  • Simplify the task node that initializes extracts
  • Alloc the main data of the extract only once

It would be nice to convert the macros to enums, but we have very high values like (1ULL << 37).

Germano Cavalcante (mano-wii) marked an inline comment as done and an inline comment as not done.May 27 2021, 5:29 AM
  • Cleanup: rename variables, rearrange code, supress warnings
  • Simplify the task node that initializes extracts
  • Alloc the main data of the extract only once

It would be nice to convert the macros to enums, but we have very high values like (1ULL << 37).

typedef enum eExtractionType: long long {
M_EXTRACT_POS_NOR = (1ULL << 0),
M_EXTRACT_POS_NOR_HQ = (1ULL << 1),
M_EXTRACT_LNOR = (1ULL << 2),
M_EXTRACT_LNOR_HQ = (1ULL << 3),
M_EXTRACT_ORCO = (1ULL << 4),
M_EXTRACT_WEIGHTS = (1ULL << 5),
M_EXTRACT_EDIT_DATA = (1ULL << 6),
M_EXTRACT_EDITUV_DATA = (1ULL << 7),
M_EXTRACT_FDOTS_POS = (1ULL << 8),
M_EXTRACT_FDOTS_UV = (1ULL << 9),
M_EXTRACT_FDOTS_EDITUV_DATA = (1ULL << 10),
M_EXTRACT_POLY_IDX = (1ULL << 11),
M_EXTRACT_EDGE_IDX = (1ULL << 12),
M_EXTRACT_VERT_IDX = (1ULL << 13),
M_EXTRACT_FDOT_IDX = (1ULL << 14),

/* Single Thread. */
M_EXTRACT_TRIS = (1ULL << 15),
M_EXTRACT_LINES = (1ULL << 16),
M_EXTRACT_LINES_WITH_LINES_LOOSE = (1ULL << 17),
M_EXTRACT_POINTS = (1ULL << 18),
M_EXTRACT_FDOTS = (1ULL << 19),
M_EXTRACT_LINES_PAINT_MASK = (1ULL << 20),
M_EXTRACT_LINES_ADJACENCY = (1ULL << 21),
M_EXTRACT_EDITUV_TRIS = (1ULL << 22),
M_EXTRACT_EDITUV_LINES = (1ULL << 23),
M_EXTRACT_EDITUV_POINTS = (1ULL << 24),
M_EXTRACT_EDITUV_FDOTS = (1ULL << 25),
M_EXTRACT_UV = (1ULL << 26),
M_EXTRACT_TAN = (1ULL << 27),
M_EXTRACT_TAN_HQ = (1ULL << 28),
M_EXTRACT_SCULPT_DATA = (1ULL << 29),
M_EXTRACT_VCOL = (1ULL << 30),
M_EXTRACT_EDGE_FAC = (1ULL << 31),
M_EXTRACT_EDITUV_STRETCH_AREA = (1ULL << 32),
M_EXTRACT_EDITUV_STRETCH_ANGLE = (1ULL << 33),
M_EXTRACT_FDOTS_NOR = (1ULL << 34),
M_EXTRACT_FDOTS_NOR_HQ = (1ULL << 35),
M_EXTRACT_MESH_ANALYSIS = (1ULL << 36),
M_EXTRACT_SKIN_ROOTS = (1ULL << 37),
} eExtractionType;

Seems to work in CPP, but I rather use a enum class. So we can just add a comment.
After this patch I want to convert the solution to CPP and do some clean up passes.

In order to make this patch good for master I would like the MeshExtract back in order to make the code more maintainable.
You can make a list of a struct containing a MeshExtract references + the result of the init function and iterate over them.
When converting to CPP we can get the typing back in the functions.

Not for this patch, but to make sure we are on the same page, the goal split for this file would be (scheduling, render mesh data, and a file per extractor type) ideally the extractor would be header only files. On planning I would have done this before you started with this overhaul, but did had time to draft the final solution.

Note that the loop macros may only be called once in this solution and we might want to get rid of them after this patch landed.

source/blender/draw/intern/draw_cache_extract_mesh.c
6150

Use a minimum. otherwise loose geometry and small meshes (cube) would get a batch per vert.

  • bring MeshExtract back
  • ensure range less than chunk
  • Cleanups
source/blender/draw/intern/draw_cache_extract_mesh.c
909–910

would keep the original function names. That would group the functions together in IDEs

1204–1208

Hmmm... note to self (not related to this patch) can this loop have an early exit or does it really need to go over all vertices. It seems like only the last vertice data is stored.

Germano Cavalcante (mano-wii) marked 2 inline comments as done.
  • Restore the original functions name
  • Early exit the fdots loop (solution based on the fdots_pos)

When porfiling, we notice that this patch does not bring improvements to the Mesh type update.
But for BMesh the difference is significant:

large_mesh_editing.blend:
Before: rdata 9ms iter 52ms (frame 169ms)
After: rdata 9ms iter 36ms (frame 152ms)

subdiv_mesh_cage_and_final.blend (DEBUG_TIME returns 3 different results):
Before: [rdata 5ms iter 30ms (frame 195ms)] [rdata 5ms iter 28ms (frame 185ms)] [rdata 5ms iter 32ms (frame 181ms)]
After: [rdata 5ms iter 36ms (frame 202ms)] [rdata 4ms iter 34ms (frame 191ms)] [rdata 5ms iter 38ms (frame 188ms)]

Comparing the timeline of the threads (before top, after bottom), it is possible to notice that the part with multithreaded loop ranges is more efficient.
But for some reason (which still needs to be investigated), the part with simple loops in single thread is less efficient:


Has been committed as part of {44d2479dc36fe3409afc660eea369bea8a517987}

This revision is now accepted and ready to land.May 31 2021, 5:24 PM