Page MenuHome

T94310: Making sure we always use the right number of logical processors
ClosedPublic

Authored by Jagannadhan Ravi (easythrees) on Dec 23 2021, 5:11 AM.

Details

Summary
NOTE: This is a fix for https://developer.blender.org/T94310

On Windows 11 with a 3990X (with Hyperthreading enabled), if you render with Cycles, you'll see in the task manager only half the threads get used. A 3990X has 64 physical cores and 128 logical ones when hyperthreading is turned on. Windows 10 doesn't show this problem. per Microsoft, the issue seems to be a change in the NUMA API: https://docs.microsoft.com/en-us/windows/win32/procthread/numa-support#behavior-starting-with-windows-10-build-20348; since Blender uses TBB, updating TBB will probably address this issue (note that the Arnold renderer uses TBB v2020.3 and doesn't have this issue on Windows 11).

Exact steps for others to reproduce the error

Make sure you have a 3990X or better with hyperthreading turned on
Open Blender
Open any scene (I used the barbershop from here: https://www.blender.org/download/demo-files/)
Open the Task Manager, make sure it's on top
Render the scene, and you should see half the logical cores being used.

Fix: Change the API call we use to get the number of active cores/threads.

Diff Detail

Repository
rB Blender

Event Timeline

Jagannadhan Ravi (easythrees) requested review of this revision.Dec 23 2021, 5:11 AM
Jagannadhan Ravi (easythrees) created this revision.

For the future, please don't mix functional with non functional (white space) changes in one patch. This way it's easier to review.

Also the patch description is a little bare, a description of what the problem is and how this patch fixes it be nice, for ideas see ingredients of a patch

Can this be simplified to std::thread::hardware_concurrency ?

Can this be simplified to std::thread::hardware_concurrency ?

Nope. I tried that, but it doesn't work in Windows 11.

Nope. I tried that, but it doesn't work in Windows 11.

Bummer.

Did you try using TBB Update3 in Blender to see if that solves the issue? It would be really nice to find an easier solution to this issue (or, at least, investigate that there are no easier ones :)

BTW, the BLI_system_thread_count also requires fixing by the sounds of it.

Nope. I tried that, but it doesn't work in Windows 11.

Bummer.

Did you try using TBB Update3 in Blender to see if that solves the issue? It would be really nice to find an easier solution to this issue (or, at least, investigate that there are no easier ones :)

BTW, the BLI_system_thread_count also requires fixing by the sounds of it.

Yeah, that was the first thing @Ray Molenkamp (LazyDodo) and I tried which also didn’t work. I thought that Cycles used TBB, but it looks like CyclesX uses this API.

I can take a look at the other method to fix it, but that should be its own change I think. Best to keep things modular.

Cycles does use TBB for threading and scheduling. This is more of not-so-expected reemergence of a code which I've hoped could be simplified away. But ok, lets fix the numaAPI for now.

For this, can you please remove all non-functional changes form the patch? Seems you've run auto-formatter which introduced a huge amount of unrelated changes.

What I also don't understand is why iterating over all NUMA nodes and counting their available CPUs doesn't give the number we are looking for. Sounds like a but might be in there. In any case, we should not have platform specific #ifdef around numaAPI calls as the library is supposed to take care of all platform specific decisions.

Does using tbb::this_task_arena::max_concurrency() instead work?

diff --git a/intern/cycles/device/device.cpp b/intern/cycles/device/device.cpp
index 2b067d5..d3de6c5 100644
--- a/intern/cycles/device/device.cpp
+++ b/intern/cycles/device/device.cpp
@@ -333,7 +334,7 @@ DeviceInfo Device::get_multi_device(const vector<DeviceInfo> &subdevices,
     /* Ensure CPU device does not slow down GPU. */
     if (device.type == DEVICE_CPU && subdevices.size() > 1) {
       if (background) {
-        int orig_cpu_threads = (threads) ? threads : system_cpu_thread_count();
+        int orig_cpu_threads = (threads) ? threads : TaskScheduler::num_threads();
         int cpu_threads = max(orig_cpu_threads - (subdevices.size() - 1), 0);

         VLOG(1) << "CPU render threads reduced from " << orig_cpu_threads << " to " << cpu_threads
diff --git a/intern/cycles/util/task.cpp b/intern/cycles/util/task.cpp
index ce61bf8..eeccbaf 100644
--- a/intern/cycles/util/task.cpp
+++ b/intern/cycles/util/task.cpp
@@ -89,7 +89,7 @@ void TaskScheduler::init(int num_threads)
     active_num_threads = num_threads;
   }
   else {
-    active_num_threads = system_cpu_thread_count();
+    active_num_threads = tbb::this_task_arena::max_concurrency();
   }
 }

We already rely on TBB to create the threads based on its own count. And if your fix works, then that implies our current TBB version is counting the cores correctly.

And then we could get rid of numaapi in Cycles.

Does using tbb::this_task_arena::max_concurrency() instead work?

diff --git a/intern/cycles/device/device.cpp b/intern/cycles/device/device.cpp
index 2b067d5..d3de6c5 100644
--- a/intern/cycles/device/device.cpp
+++ b/intern/cycles/device/device.cpp
@@ -333,7 +334,7 @@ DeviceInfo Device::get_multi_device(const vector<DeviceInfo> &subdevices,
     /* Ensure CPU device does not slow down GPU. */
     if (device.type == DEVICE_CPU && subdevices.size() > 1) {
       if (background) {
-        int orig_cpu_threads = (threads) ? threads : system_cpu_thread_count();
+        int orig_cpu_threads = (threads) ? threads : TaskScheduler::num_threads();
         int cpu_threads = max(orig_cpu_threads - (subdevices.size() - 1), 0);

         VLOG(1) << "CPU render threads reduced from " << orig_cpu_threads << " to " << cpu_threads
diff --git a/intern/cycles/util/task.cpp b/intern/cycles/util/task.cpp
index ce61bf8..eeccbaf 100644
--- a/intern/cycles/util/task.cpp
+++ b/intern/cycles/util/task.cpp
@@ -89,7 +89,7 @@ void TaskScheduler::init(int num_threads)
     active_num_threads = num_threads;
   }
   else {
-    active_num_threads = system_cpu_thread_count();
+    active_num_threads = tbb::this_task_arena::max_concurrency();
   }
 }

We already rely on TBB to create the threads based on its own count. And if your fix works, then that implies our current TBB version is counting the cores correctly.

And then we could get rid of numaapi in Cycles.

That seems to work. I'll redo the change. I'm not sure why there's all this extra spacing stuff in the change, I even ran a make format before building the change.

Great.

I think we could make a similar change on the Blender side, use tbb::this_task_arena::max_concurrency() in BLI_task_scheduler_init(). And then replace BLI_system_thread_count() with BLI_task_scheduler_num_threads() (probably everywhere but needs to be checked if it makes sense).

Great.

I think we could make a similar change on the Blender side, use tbb::this_task_arena::max_concurrency() in BLI_task_scheduler_init(). And then replace BLI_system_thread_count() with BLI_task_scheduler_num_threads() (probably everywhere but needs to be checked if it makes sense).

I'll do this up in a separate change once we get this one through. Will update this in a bit, need to attend to another issue first.

Question, should I update this review with the new code change, or should I make a new one (since it's completely different files)? @Ray Molenkamp (LazyDodo) ?

Updated approach using TBB to determine max concurrency.

Should fix formatting issues (thanks @Ray Molenkamp (LazyDodo)!)

This is very nice to see such simplified solution!

This revision is now accepted and ready to land.Jan 7 2022, 11:38 AM

Thanks for the patch, i've committed it now. Good work!

@Jagannadhan Ravi (easythrees) if you can submit similar change for the Blender side that'd be lovely!