This replaces the single-threaded calculation of mesh min and max
positions with a parallel_reduce loop. Since the bounding box
of a mesh is retrieved quite often (at the end of each evaluation,
currently 2(?!) times when leaving edit mode, etc.), this makes for a
quite noticeable speedup actually.
On my Ryzen 3700x and a 4.2 million vertex mesh, I observed
a 4.4x performance increase, from 14 ms to 4.4 ms.
I added some methods to float3 so they would be inlined, but
they're also a nice addition, since they're used often anyway.
If this is accepted, I'll look into updating the bounding boxes for some
other object types, mesh needs a special implementation because
MVert != float3 for now though ;).