Why multi-threaded code runs slower on faster machines?

  • A+
Category:Languages

Consider following c++ code:

#include "threadpool.hpp" #include <chrono> #include <list> #include <iostream> #include <cmath>  int loop_size;  void process(int num) {     double x = 0;     double sum = 0;     for(int i = 0; i < loop_size; ++i) {         x += 0.0001;         sum += sin(x) / cos(x) + cos(x) * cos(x);     } }  int main(int argc, char* argv[]) {     if(argc < 3) {         std::cerr << argv[0] << " [thread_pool_size] [threads] [sleep_time]" << std::endl;         exit(0);     }     thread_pool* pool = nullptr;     int th_count = std::atoi(argv[1]);     if(th_count != 0) {         pool = new thread_pool(th_count);     }     loop_size = std::stoi(argv[3]);     int max = std::stoi(argv[2]);     auto then = std::chrono::steady_clock::now();     std::list<std::thread> ths;     if(th_count == 0) {         for(int i = 0; i < max; ++i) {             ths.emplace_back(&process, i);         }         for(std::thread& t : ths) {             t.join();         }     } else {         for(int i = 0; i < max; ++i) {             pool->enqueue(std::bind(&process, i));         }         delete pool;     }     int diff = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - then).count();     std::cerr << "Time: " << diff << '/n';     return 0; } 

And "threadpool.hpp" is modified version of this github repo and it is available here

I compiled above code on my machine (Corei7-6700) and a 88-core server (2x Xeon E5-2696 v4). The results I can't explain.

This is how I run the code:

tp <threadpool size> <number of threads> <iterations> 

The very same code runs slower on faster machines! I have 8 cores on my local machine and 88 cores on remote server and these are results: (last two columns indicate average time to complete in milliseconds on each machine)

+============+=========+============+=============+====================+ | Threadpool | Threads | Iterations | Corei7-6700 | 2x Xeon E5-2696 v4 | +============+=========+============+=============+====================+ |        100 |  100000 |       1000 |        1300 |               6000 | +------------+---------+------------+-------------+--------------------+ |       1000 |  100000 |       1000 |        1400 |               5000 | +------------+---------+------------+-------------+--------------------+ |      10000 |  100000 |       1000 |        1470 |               3400 | +------------+---------+------------+-------------+--------------------+ 

It seems having more cores makes the code run slower. So I reduced CPU affinity on server (taskset) to 8 cores and run the code again:

taskset 0-7 tp <threadpool size> <number of threads> <iterations> 

This is the new data:

+============+=========+============+=============+====================+ | Threadpool | Threads | Iterations | Corei7-6700 | 2x Xeon E5-2696 v4 | +============+=========+============+=============+====================+ |        100 |  100000 |       1000 |        1300 |                900 | +------------+---------+------------+-------------+--------------------+ |       1000 |  100000 |       1000 |        1400 |               1000 | +------------+---------+------------+-------------+--------------------+ |      10000 |  100000 |       1000 |        1470 |               1070 | +------------+---------+------------+-------------+--------------------+ 

I have tested the same code on a 32-core Xeon and a 22-core old Xeon machine, and the pattern is similar: Having less cores, makes the multi-threaded code run faster. But why?

IMPORTANT NOTE: This is an effort to solve my original problem here:

Why having more and faster cores makes my multithreaded software slower?

Notes:

  1. The operating system and compilers are same on all machines: debian 9.0 amd64 running kernel 4.0.9-3, 6.3.0 20170516
  2. No additional flasg, default optimization: g++ ./threadpool.cpp -o ./tp -lpthread

 


In general, for CPU-bound code like this, you shouldn't expect any benefit from running more threads in your pool than you have cores to execute them.

For example, comparing pools with 1, 2, ... N/2 ... N ... N*2 threads for an N-core socket might be interesting. A pool with 10*N threads is really just testing how the scheduler behaves under load.

Then, also in general, you need some idea of the per-task overhead: the more tasks you split your work into, the more time is spent creating, destroying, and synchronizing access to those tasks. Varying the sub-task size for a fixed amount of work is a good way to see this.

Finally, it helps to know something about the physical architecture you're using. The NUMA server platform can do exactly twice as much work with its two sockets as the same single CPU could do alone - if each socket accesses only its own directly-attached memory. As soon as you're transferring data across the QPI, performance degrades. Bouncing a heavily-contended cacheline like your mutex across the QPI can slow the whole thing right down.

Similarly, if you have N cores and want to run N threads in your pool - do you know if they're physical cores, or hyperthreaded logical cores? If they're HT, do you know if your threads will be able to run full-speed, or will they contend for limited shared resources?

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: