I am working on a presentation about writing code to boost execution performance. I have chosen lattice_deform as a test-ground for this.
This patch is the result of several experiments to increase the execution performance of lattice deformation.
- Adds test-cases to compare the effect with the old implementation. The tests differs by the number of verts to transform and the batch size.
- The old implementation calculated one vert, and released static data that can be shared with other verts.
- Using branchless code tricks to minimize the branches.
- Use phased approach to reduce inner lop complexity.
- Use batching to reduce the memory cache demand.
Old implementation
[ RUN ] lattice_deform_performance.performance_no_dvert_1 [ OK ] lattice_deform_performance.performance_no_dvert_1 (0 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_1000 [ OK ] lattice_deform_performance.performance_no_dvert_1000 (0 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000 [ OK ] lattice_deform_performance.performance_no_dvert_10000 (4 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_100000 [ OK ] lattice_deform_performance.performance_no_dvert_100000 (32 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_1000000 [ OK ] lattice_deform_performance.performance_no_dvert_1000000 (319 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000 (3167 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1 (3197 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10 (3206 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch100 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch100 (3200 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1000 (3202 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10000 (3182 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch100000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch100000 (3165 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1000000 (3153 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10000000 (3204 ms)
Current progress.
[ RUN ] lattice_deform_performance.performance_no_dvert_1 [ OK ] lattice_deform_performance.performance_no_dvert_1 (0 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_1000 [ OK ] lattice_deform_performance.performance_no_dvert_1000 (1 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000 [ OK ] lattice_deform_performance.performance_no_dvert_10000 (3 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_100000 [ OK ] lattice_deform_performance.performance_no_dvert_100000 (24 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_1000000 [ OK ] lattice_deform_performance.performance_no_dvert_1000000 (199 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000 (1959 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1 (1788 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10 (1768 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch100 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch100 (1732 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1000 (1692 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10000 (1737 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch100000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch100000 (1767 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch1000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch1000000 (1808 ms) [ RUN ] lattice_deform_performance.performance_no_dvert_10000000_batch10000000 [ OK ] lattice_deform_performance.performance_no_dvert_10000000_batch10000000 (1949 ms)
NOTE: In order to be more useful the weight should be made per vertex. In stead of using a vec3 we can use a vec4 where the 4th element is the weight (including dvert weight of the target. For this we might need to add more smaller structs so data isn't scattered to much around in memory
NOTE: This is a PoC, in order to actually have benefit for the user it needs more work. This patch is part of a presentation about writing code that run faster.