Description
NOTE: Not a complaint just an observation that might be useful to discuss / design / improve that scheduling model.
Hi I ran our benchmarks with 1.6.2 and current master after #836 was merged and I noticed a rather significant change in performance characteristics.
The following two tables show the throughput of MB/s in the benchmark at a different number of concurrent tasks and different CPU setups.
The benchmarks are run on a Ryzen Threadripper 24 Core (48 thread) CPU. It's worth noting that these CPU houses (simplified) 8 CCXs (3 cores/6 threads each) that share their cache, otherwise they communicate over an internal bus (somewhat acting as different NUMA domains).
So in the 3 core test, the three cores specifically are picked to be on the same CCX, meaning they share their cache, in the 48 core tests all cores (or threads) are used meaning no chances are shared by default.
It looks like that in 1.6.2 tasks that were related were rather likely to a be scheduled on the same executor (or an executor on the same CCX) so the 3 steps in the benchmark, deserialize
-> process
-> serialize
were more often executed on the same CCX (leading to higher performance in the 48 thread case) however overall was less performant (in the 3 core case).
master switches this around, it seems to more often than not schedule-related tasks on new cores leading to a (sometimes much) lower performance on the 48 core case but much better performance on the limited core set with the shared cache.
We observed the same with threads (to a stronger degree), that threads outperformed tasks on the 3 core setup but unless pinned would be scheduled around so much that the loss of cache would obliterate performance.
It seems that master is very friendly to UMA systems (or generally systems where all cores share cache) but less friendly to NUMA systems that don't have that luxury (counting the Ryzen platform here even so it's a bit of a special case).
1.6.2 (throughput in MB/s)
# tasks/2 | 3 cores | 48 cores |
---|---|---|
1 | 181.7 | 84.9 |
2 | 276.1 | 337.3 |
4 | 291.5 | 333.3 |
8 | 260.9 | 325.9 |
16 | 254.7 | 320.2 |
32 | 250.3 | 302.8 |
64 | 243.2 | 305.8 |
master (throughput in MB/s)
# tasks/2 | 3 cores | 48 cores |
---|---|---|
1 | 273.8 | 128.6 |
2 | 339.1 | 292.3 |
4 | 342.7 | 234.2 |
8 | 308.3 | 209.9 |
16 | 265.6 | 210.9 |
32 | 245.9 | 201.2 |
64 | 229.8 | 197.4 |